The challenges of applying self-supervised learning to 3D mesh data include difficulties in explicitly modeling and leveraging geometric topology information and designing appropriate pretext tasks and augmentation me...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
The challenges of applying self-supervised learning to 3D mesh data include difficulties in explicitly modeling and leveraging geometric topology information and designing appropriate pretext tasks and augmentation methods for irregular mesh topology. In this paper, we propose a novel approach for pre-training models on large-scale, unlabeled datasets using graph masking on a mesh graph composed of faces. Our method, Mesh Graph Masked Autoencoders (MGM-AE), utilizes masked autoencoding to pre-train the model and extract important features from the data. Our pre-trained model outperforms prior state-of-the-art mesh encoders in shape classification and segmentation benchmarks, achieving 90.8% accuracy on ModelNet40 and 78.5 mIoU on ShapeNet. The best performance is obtained when the model is trained and evaluated under different masking ratios. Our approach demonstrates effectiveness in pre-training models on large-scale, unlabeled datasets and its potential for improving performance on downstream tasks.
Deep neural networks are susceptible to spurious features strongly correlating with the target. This phenomenon leads to sub-optimal performance during real-world deployment where spurious correlations do not exist, l...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Deep neural networks are susceptible to spurious features strongly correlating with the target. This phenomenon leads to sub-optimal performance during real-world deployment where spurious correlations do not exist, leading to deployment challenges in safety-critical environments like healthcare. While spurious features can correlate with causal features in myriad ways, we propose a solution for a common manifestation in computervision where the background corresponds to a spurious feature. In contrast to previous works, we do not require apriori knowledge of different groups in the data induced by the presence/absence of spurious features and corresponding access to samples. We propose a method, Causal Feature Alignment (CFA), to ignore the spurious background features by utilizing segmentations on a small subset of training data. To reduce the annotation burden, we reduce the pixel-wise annotation task of segmentation to a review task of selecting the best mask by utilizing the recently released foundation model and a feature attribution method. We demonstrate our method on a wide range of datasets, including the semi-synthetic ColoredMNIST, WaterBirds, and ImageNet Backgrounds Challenge, and obtain significant gains over state-of-the-art methods.
Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise de...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation. Our code is publicly available (1).
Multimodal Person Re-identification is gaining popularity in the research community due to its effectiveness compared to counter-part unimodal frameworks. However, the bottleneck for multimodal deep learning is the ne...
详细信息
ISBN:
(纸本)9798350370287;9798350370713
Multimodal Person Re-identification is gaining popularity in the research community due to its effectiveness compared to counter-part unimodal frameworks. However, the bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples. Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models. Augmenting in other modalities than images, such as text, is challenging and requires significant computational resources and external data sources. In this study, we investigate the effectiveness of two computervision data augmentation techniques: "cutout" and "cutmix", for text augmentation in multi-modal person re-identification. Our approach merges these two augmentation strategies into one strategy called "CutMixOut" which involves randomly removing words or sub-phrases from a sentence (Cutout) and blending parts of two or more sentences to create diverse examples (CutMix) with a certain probability assigned to each operation. This augmentation was implemented at inference time without any prior training. Our results demonstrate that the proposed technique is simple and effective in improving the performance on multiple multimodal person re-identification benchmarks.
Manually reading and logging gauge data is time-inefficient, and the effort increases according to the number of gauges available. We present a pipeline that automates the reading of analog gauges. We propose a two-st...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Manually reading and logging gauge data is time-inefficient, and the effort increases according to the number of gauges available. We present a pipeline that automates the reading of analog gauges. We propose a two-stage CNN pipeline that identifies the key structural components of an analog gauge and outputs an angular reading. To facilitate the training of our approach, a synthetic dataset is generated thus obtaining a set of realistic analog gauges with their corresponding annotation. To validate our proposal, an additional real-world dataset was collected with 4.813 manually curated images. When compared against state-of-the-art methodologies, our method shows a significant improvement of 4.55 degrees in the average error, which is a 52% relative improvement. The resources for this project will be made available at: https://***/fuankarion/automatic-gauge-reading.
Over the past few decades, a significant rise of camera-based applications for traffic monitoring has occurred. Governments and local administrations are increasingly relying on the data collected from these cameras t...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Over the past few decades, a significant rise of camera-based applications for traffic monitoring has occurred. Governments and local administrations are increasingly relying on the data collected from these cameras to enhance road safety and optimize traffic conditions. However, for effective data utilization, it is imperative to ensure accurate and automated calibration of the involved cameras. This paper proposes a novel approach to address this challenge by leveraging the topological structure of intersections. We propose a framework involving the generation of a set of synthetic intersection viewpoint images from a bird's-eye-view image, framed as a graph of virtual cameras to model these images. Using the capabilities of Graph Neural Networks, we effectively learn the relationships within this graph, thereby facilitating the estimation of a homography matrix. This estimation leverages the neighbourhood representation for any real-world camera and is enhanced by exploiting multiple images instead of a single match. In turn, the homography matrix allows the retrieval of extrinsic calibration parameters. As a result, the proposed framework demonstrates superior performance on both synthetic datasets and real-world cameras, setting a new state-of-the-art benchmark.
The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive ...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.
vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This m...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even surpass human capability in reasoning times and location. To address this question, we propose a two-stage RECOGNITION & REASONING probing task applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the studies, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In extensive evaluation experiments, we find that although VLMs can effectively retain times and location-relevant features in visual encoders, they still fail to make perfect reasoning with context-conditioned visual features. The dataset is available at https://***/gengyuanmax/WikiTiLo.
The Ball-Pivoting Algorithm (BPA) is a notable technique for 3D surface reconstruction from point clouds, heavily reliant on the ball radius. In practical application, determining the optimal radius for BPA often nece...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
The Ball-Pivoting Algorithm (BPA) is a notable technique for 3D surface reconstruction from point clouds, heavily reliant on the ball radius. In practical application, determining the optimal radius for BPA often necessitates iterative experimentation to achieve better reconstruction quality. BPA entails geometric computations like iterative pivoting, inherently lacking differentiability. In this paper, we tackle the dual challenges of radius selection and non-differentiability in BPA. Inspired by contextual bandits, we propose an innovative approach that learns the optimal radius based on local geometric features within point clouds. We validate our method on the ModelNet10 and ShapeNet datasets, showcasing superior surface reconstruction compared to manual tuning and other classic methods both for low and high point cloud densities. Our code is available at https://github. com/ houda- pixel/ AutoBPA.
Neural radiance fields, or NeRF, represent a breakthrough in the field of novel view synthesis and 3D modeling of complex scenes from multi-view image collections. Numerous recent works have shown the importance of ma...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Neural radiance fields, or NeRF, represent a breakthrough in the field of novel view synthesis and 3D modeling of complex scenes from multi-view image collections. Numerous recent works have shown the importance of making NeRF models more robust, by means of regularization, in order to train with possibly inconsistent and/or very sparse data. In this work, we explore how differential geometry can provide elegant regularization tools for robustly training NeRF-like models, which are modified so as to represent continuous and infinitely differentiable functions. In particular, we present a generic framework for regularizing different types of NeRFs observations to improve the performance in challenging conditions. We also show how the same formalism can also be used to natively encourage the regularity of surfaces by means of Gaussian or mean curvatures.
暂无评论