With the rapid advances in generative adversarial networks (GANs), the visual quality of synthesised scenes keeps improving, including for complex urban scenes with applications to automated driving. We address in thi...
详细信息
ISBN:
(纸本)9781665487399
With the rapid advances in generative adversarial networks (GANs), the visual quality of synthesised scenes keeps improving, including for complex urban scenes with applications to automated driving. We address in this work a continual scene generation setup in which GANs are trained on a stream of distinct domains;ideally, the learned models should eventually be able to generate new scenes in all seen domains. This setup reflects the real-life scenario where data are continuously acquired in different places at different times. In such a continual setup, we aim for learning with zero forgetting, i.e., with no degradation in synthesis quality over earlier domains due to catastrophic forgetting. To this end, we introduce a novel framework that not only (i) enables seamless knowledge transfer in continual training but also (ii) guarantees zero forgetting with a small overhead cost. While being more memory efficient, thanks to continual learning, our model obtains better synthesis quality as compared against the brute-force solution that trains one full model for each domain. Especially, under extreme low-data regimes, our approach outperforms the brute-force one by a large margin.
With bounding box labels needed for training, object detection is viewed unfavorably in terms of crowd analysis, due to the intensive labor for labeling and the unsatisfactory performance in clutters and severe occlus...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
With bounding box labels needed for training, object detection is viewed unfavorably in terms of crowd analysis, due to the intensive labor for labeling and the unsatisfactory performance in clutters and severe occlusions. Another feasible method, density-based regression, despite its proficiency in counting and only point-level labels used for training, cannot get the location of each person, and the time and space consumption is relatively high. In this paper, we propose a generic feature extraction framework, Adaptive Pyramid Score (APS), based on object detection and designed specifically for extracting quantitative and spatial-semantic features. Moreover, as an intuitive and feasible solution regarding crowd analysis, we propose the weakly-supervised Confidence-Threshold-Foresight Network (CTFNet) under our APS feature extraction framework, which only needs count-level labels for training and improves the performance of various methods dramatically. Our system realizes the triple enhancement of counting, localization, and detection, which is also proved to be faster than advanced crowd analysis methods, lighter to be transplanted to various object detection methods, and robuster to tackle tasks of extreme scenes. Furthermore, the weakly-supervised paradigm leverage the intensive labor for labeling profoundly.
We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, where the model accumulates inductive bias through learning from many training tasks, in order to solve new unseen few-shot tasks. In contrast, we propose a simple semi-supervised FSL approach to exploit unlabeled data accompanying the few-shot task to improve FSL performance. More exactly, to train a classifier, we propose a Dependency Maximization loss based on the Hilbert-Schmidt norm of the cross-covariance operator, which maximizes the statistical dependency between the embedded feature of the unlabeled data and their label predictions, together with the supervised loss over the support set. The obtained classifier is used to infer the pseudo-labels of the unlabeled data. Furthermore, we propose an Instance Discriminant Analysis to evaluate the credibility of the pseudo-labeled examples and select the faithful ones into an augmented support set, which is used to retrain the classifier. We iterate the process until the pseudo-labels of the unlabeled data becomes stable. Through extensive experiments on four widely used few-shot classification benchmarks, including mini-ImageNet, tiered-ImageNet, CUB, and CIFARFS, the proposed method outperforms previous state-of-the-art FSL methods.
Systems developed for predicting both the action and the amount of time someone might take to perform that action need to be aware of the inherent uncertainty in what humans do. Here, we present a novel hybrid generat...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Systems developed for predicting both the action and the amount of time someone might take to perform that action need to be aware of the inherent uncertainty in what humans do. Here, we present a novel hybrid generative model for action anticipation that attempts to capture the uncertainty in human actions. Our model uses a multi-headed attention-based variational generative model for action prediction (MAVAP), and Gaussian log-likelihood maximization to predict the corresponding action's duration. During training, we optimise three losses: a variational loss, a negative log-likelihood loss, and a discriminative cross-entropy loss. We evaluate our model on benchmark datasets (i.e., Breakfast and 50Salads) for action forecasting tasks and demonstrate improvements over prior methods using both ground truth observations and predicted features from an action segmentation network (i.e., MS-TCN++). We also show that factorizing the latent space across multiple Gaussian heads predicts better plausible future action sequences compared to a single Gaussian.
Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing mult...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing multiple multi-modal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. To learn an informative embedding that can flexibly encode the semantics of various queries, we propose a novel multimodal probabilistic composer (MPC). Specifically, we model input images and texts as probabilistic embeddings, which can be further composed by a probabilistic composition rule to facilitate image retrieval with multiple multimodal queries. We propose a new benchmark based on the MS-COCO dataset and evaluate our model on various setups that compose multiple images and (or) text queries for multimodal image retrieval. Without bells and whistles, we show that our probabilistic model formulation significantly outperforms existing related methods on multimodal image retrieval while generalizing well to query with different amounts of inputs given in arbitrary visual and (or) textual modalities.
This work reviews the results of the NTIRE 2024 Challenge on Shadow Removal. Building on the last year edition, the current challenge was organized in two tracks, with a track focused on increased fidelity reconstruct...
详细信息
ISBN:
(纸本)9798350365474
This work reviews the results of the NTIRE 2024 Challenge on Shadow Removal. Building on the last year edition, the current challenge was organized in two tracks, with a track focused on increased fidelity reconstruction, and a separate ranking for high performing perceptual quality solutions. Track 1 (fidelity) had 214 registered participants, with 17 teams submitting in the final phase, while Track 2 (perceptual) registered 185 participants, resulting in 18 final phase submissions. Both tracks were based on data from the WSRD dataset, simulating interactions between self-shadows and cast shadows, with a large variety of represented objects, textures, and materials. Improved image alignment enabled increased fidelity reconstruction, with restored frames mostly indistinguishable from the references images for top performing solutions.
Offline signature forgery detection has attracted many researchers in recent years. In real situations, signatures should be detected from the signed documents and verified by the forgery detection system. There are m...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Offline signature forgery detection has attracted many researchers in recent years. In real situations, signatures should be detected from the signed documents and verified by the forgery detection system. There are many challenges in the pipeline. First, some signatures have low resolutions and are difficult to be detected. Second, the cropped signatures may contain irrelevant background context of the document, making the signature hard to be verified. Third, some forgery signatures are very similar to genuine ones, increasing the challenge of verification. In addition, most existing datasets do not cover all the pipeline tasks. Moreover, publicly available Chinese-based signature datasets are rare for research purposes. In this paper, we construct a novel Chinese document offline signature forgery detection benchmark, namely ChiSig, which includes all pipeline tasks, i.e., signature detection, restoration, and verification. Besides, we extensively compare different deep learning-based approaches in these three tasks. The results show that our proposed dataset can effectively provide solutions for constructing pipeline systems for Chinese document signature forgery detection.
We tackle the task of Few-Shot Counting. Given an image containing multiple objects of a novel visual category and few exemplar bounding boxes depicting the visual category of interest, we want to count all of the ins...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We tackle the task of Few-Shot Counting. Given an image containing multiple objects of a novel visual category and few exemplar bounding boxes depicting the visual category of interest, we want to count all of the instances of the desired visual category in the image. A key challenge in building an accurate few-shot visual counter is the scarcity of annotated training data due to the laborious effort needed for collecting and annotating the data. To address this challenge, we propose Vicinal Counting Networks, which learn to augment the existing training data along with learning to count. A Vicinal Counting Network consists of a generator and a counting network. The generator takes as input an image along with a random noise vector and generates an augmented version of the input image. The counting network learns to count the objects in the original and augmented images. The training signal for the generator comes from the counting loss of the counting network, and the generator aims to synthesize images which result in a small counting loss. Unlike GANs which are trained in an adversarial setting, Vicinal Counting Networks are trained in a cooperative setting where the generator aims to help the counting network in achieving accurate predictions on the synthesized images. We also show that our proposed data augmentation framework can be extended to other counting tasks like crowd counting.
In this paper, we propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (SPARTAN) approach to Group Activity recognition (GAR) using unlabeled video data. Given a video, we create local and...
详细信息
Few-shot and cross-domain land use scene classification methods propose solutions to classify unseen classes or unseen visual distributions, but are hardly applicable to realworld situations due to restrictive assumpt...
详细信息
ISBN:
(纸本)9781665487399
Few-shot and cross-domain land use scene classification methods propose solutions to classify unseen classes or unseen visual distributions, but are hardly applicable to realworld situations due to restrictive assumptions. Few-shot methods involve episodic training on restrictive training subsets with small feature extractors, while cross-domain methods are only applied to common classes. The underlying challenge remains open: can we accurately classify new scenes on new datasets? In this paper, we propose a new framework for few-shot, cross-domain classification. Our retrieval-inspired approach(1) exploits the interrelations in both the training and testing data to output class labels using compact descriptors. Results show that our method can accurately produce land-use predictions on unseen datasets and unseen classes, going beyond the traditional few-shot or cross-domain formulation, and allowing cross-dataset training.
暂无评论