Siamese network architectures trained for self-supervised instance recognition can learn powerful visual representations that are useful in various tasks. Many such approaches maximize the similarity between represent...
详细信息
ISBN:
(纸本)9781665487399
Siamese network architectures trained for self-supervised instance recognition can learn powerful visual representations that are useful in various tasks. Many such approaches maximize the similarity between representations of augmented images of the same object. In this paper, we depart from traditional self-supervised learning benchmarks by defining a novel methodology for new challenging tasks such as zero shot pose estimation. Our goal is to show that common Siamese networks can effectively be trained on frame pairs from video sequences to generate pose-informed representations. Unlike parallel efforts that focus on introducing new image-space operators for data augmentation, we argue that extending the augmentation strategy by using different frames of a video leads to more powerful representations. To show the effectiveness of this approach, we use the Objectron and UCF101 datasets to learn representations and evaluate them on pose estimation, action recognition, and object re-identification. Furthermore, we carefully validate our method against a number of baselines.
Multi-camera tracking of vehicles on a city-wide level is a core component of modern traffic monitoring systems. For this task, single-camera tracking failures are the most common causes of errors concerning automatic...
详细信息
ISBN:
(纸本)9781665487399
Multi-camera tracking of vehicles on a city-wide level is a core component of modern traffic monitoring systems. For this task, single-camera tracking failures are the most common causes of errors concerning automatic multi-target multi-camera tracking systems. To address these problems, we propose several modules that aim at improving single-camera tracklets, e.g., appearance-based tracklet splitting, single-camera clustering, and track completion. After these track refinement steps, hierarchical clustering is used to associate the enhanced single-camera tracklets. During this stage, we leverage vehicle re-identification features as well as prior knowledge about the scene's topology. Last, the proposed track completion strategy is adopted for the cross-camera association task to obtain the final multi-camera tracks. Our method proves itself competitive: With it, we achieved 4th place in track 1 of the 2022 AI City Challenge.
Domain adaptation for semantic segmentation across datasets consisting of the same categories has seen several recent successes. However, a more general scenario is when the source and target datasets correspond to no...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Domain adaptation for semantic segmentation across datasets consisting of the same categories has seen several recent successes. However, a more general scenario is when the source and target datasets correspond to non-overlapping label spaces. For example, categories in segmentation datasets change vastly depending on the type of environment or application, yet share many valuable semantic relations. Existing approaches based on feature alignment or discrepancy minimization do not take such category shift into account. In this work, we present Cluster-to-Adapt (C2A), a computationally efficient clustering-based approach for domain adaptation across segmentation datasets with completely different, but possibly related categories. We show that such a clustering objective enforced in a transformed feature space serves to automatically select categories across source and target domains that can be aligned for improving the target performance, while preventing negative transfer for unrelated categories. We demonstrate the effectiveness of our approach through experiments on the challenging problem of outdoor to indoor adaptation for semantic segmentation in few-shot as well as zero-shot settings, with consistent improvements in performance over existing approaches and baselines in all cases.
To train a change detector, bi-temporal images taken at different times in the same area are used. However, collecting labeled bi-temporal images is expensive and time consuming. To solve this problem, various unsuper...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
To train a change detector, bi-temporal images taken at different times in the same area are used. However, collecting labeled bi-temporal images is expensive and time consuming. To solve this problem, various unsupervised change detection methods have been proposed, but they still require unlabeled bi-temporal images. In this paper, we propose an unsupervised change detection method based on image reconstruction loss, which uses only a single-temporal unlabeled image. The image reconstruction model was trained to reconstruct the original source image by receiving the source image and photometrically transformed source image as a pair. During inference, the model receives bi-temporal images as input and aims to reconstruct one of the inputs. The changed region between bi-temporal images shows high reconstruction loss. Our change detector demonstrated significant performance on various change detection benchmark datasets even though only a single-temporal source image was used. The code and trained models are available in https://***/cjf8899/CDRL
We study a practical setting of continual learning: fine-tuning on a pre-trained model continually. Previous work has found that, when training on new tasks, the features (penultimate layer representations) of previou...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We study a practical setting of continual learning: fine-tuning on a pre-trained model continually. Previous work has found that, when training on new tasks, the features (penultimate layer representations) of previous data will change, called representational shift. Besides the shift of features, we reveal that the intermediate layers' representational shift (IRS) also matters since it disrupts batch normalization, which is another crucial cause of catastrophic forgetting. Motivated by this, we propose ConFiT, a fine-tuning method incorporating two components, cross-convolution batch normalization (Xconv BN) and hierarchical fine-tuning. Xconv BN maintains pre-convolution running means instead of post-convolution, and recovers post-convolution ones before testing, which corrects the inaccurate estimates of means under IRS. Hierarchical fine-tuning leverages a multi-stage strategy to fine-tune the pre-trained network, preventing massive changes in Conv layers and thus alleviating IRS. Experimental results on four datasets show that our method remarkably outperforms several state-of-the-art methods with lower storage overhead.
Unstructured object matching is a less-explored and very challenging topic in the scientific literature. This includes matching scenarios where the context, appearance and the geometrical integrity of the objects to b...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Unstructured object matching is a less-explored and very challenging topic in the scientific literature. This includes matching scenarios where the context, appearance and the geometrical integrity of the objects to be matched changes drastically from one image to another (e.g. a pair of pyjamas which in one image is folded and in the other is worn by a person), making it impossible to determine a transformation which aligns the matched regions. Traditional approaches like keypoint-based feature matching perform poorly on this use case due to the high complexity in terms of viewpoint, scene context variety, background variations or high degrees of freedom concerning structural configurations. In this paper we propose a deep learning framework consisting of a twins based matching approach leveraging a co-salient region segmentation task and a cosine-similarity based region descriptor pairing technique. The importance of our proposed framework is demonstrated on a novel use case consisting of image pairs with various objects used by children. Additionally, we evaluate on Human3.6M and Market-1501, two datasets with humans depicting various appearances and kinematic configurations captured under different backgrounds.
Lidar based simultaneous localization and mapping methods can be adapted for deployment on small autonomous vehicles operating in unmapped indoor environments. For this purpose, we propose a method which combines iner...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Lidar based simultaneous localization and mapping methods can be adapted for deployment on small autonomous vehicles operating in unmapped indoor environments. For this purpose, we propose a method which combines inertial data, low-drift lidar odometry, planar primitives, and loop closing in a graph-based structure. The accuracy of our method is experimentally evaluated, using a high-resolution lidar, and compared to the state-of-the-art methods LIO-SAM and Cartographer. We specifically address the lateral positioning accuracy when passing through narrow openings, where high accuracy is a prerequisite for safe operation of autonomous vehicles. The test cases include doorways, slightly wider reference passages, and a larger corridor environment. We observe a reduced lateral accuracy for all three methods when passing through the narrow openings compared to operation in larger spaces. Compared to state-of-the-art, our method shows better results in the narrow passages, and comparable results in the other environments with reasonably low usage of CPU and memory resources.
Despite the great scientific effort to capture adequately the complex environments in which autonomous vehicles (AVs) operate there are still use-cases that even SoA methods fail to handle. Specifically in odometry pr...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Despite the great scientific effort to capture adequately the complex environments in which autonomous vehicles (AVs) operate there are still use-cases that even SoA methods fail to handle. Specifically in odometry problems, on the one hand, geometric solutions operate with certain assumptions that are often breached in AVs, and on the other hand, deep learning methods do not achieve high accuracy. To contribute to that we present CarlaScenes, a large-scale simulation dataset captured using the CARLA simulator. The dataset is oriented to address the challenging odometry scenarios that cause the current state of art odometers to deviate from their normal operations. Based on a case study of failures presented in experiments we distinguished 7 different sequences of data. CarlaScenes besides providing consistent reference poses, includes data with semantic annotation at the instance level for both image and lidar. The full dataset is available at https://***/CarlaScenes/***.
Forecasting of a representation is important for safe and effective autonomy. For this, panoptic segmentations have been studied as a compelling representation in recent work. However, recent state-of-the-art on panop...
详细信息
ISBN:
(数字)9781665469463
ISBN:
(纸本)9781665487399
Forecasting of a representation is important for safe and effective autonomy. For this, panoptic segmentations have been studied as a compelling representation in recent work. However, recent state-of-the-art on panoptic segmentation forecasting suffers from two issues: first, individual object instances are treated independently of each other;second, individual object instance forecasts are merged in a heuristic manner. To address both issues, we study a new panoptic segmentation forecasting model that jointly forecasts all object instances in a scene using a transformer model based on 'difference attention.' It further refines the predictions by taking depth estimates into account. We evaluate the proposed model on the Cityscapes and AIODrive datasets. We find difference attention to be particularly suitable for forecasting because the difference of quantities like locations enables a model to explicitly reason about velocities and acceleration. Because of this, we attain state-of-the-art on panoptic segmentation forecasting metrics.
In this work, we present a single-stage framework, named S2F2, for forecasting multiple human trajectories from raw video images by predicting future optical flows. S2F2 differs from the previous two-stage approaches ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
In this work, we present a single-stage framework, named S2F2, for forecasting multiple human trajectories from raw video images by predicting future optical flows. S2F2 differs from the previous two-stage approaches in that it performs detection, Re-ID, and forecasting of multiple pedestrians at the same time. Unlike the prior approaches, the computational burden of S2F2 remains consistent even as the number of pedestrians grows. The experimental results demonstrate that S2F2 is able to outperform two conventional forecasting algorithms and a recent learning-based two-stage model [1], while maintaining its tracking performance on par with the contemporary MOT models.
暂无评论