Motion segmentation is a technique to detect and localize class-agnostic motion in videos. This motion is assumed to be relative to a stationary background and usually originates from objects such as vehicles or human...
详细信息
ISBN:
(纸本)9781665448994
Motion segmentation is a technique to detect and localize class-agnostic motion in videos. This motion is assumed to be relative to a stationary background and usually originates from objects such as vehicles or humans. When the camera moves, too, frame differencing approaches that do not have to model the stationary background over minutes, hours, or even days are more promising compared to background subtraction methods. In this paper, we propose a Deep Convolutional Neural Network (DCNN) for multi-modal motion segmentation: the current image contributes with appearance information to distinguish between relevant and irrelevant motion and frame differencing captures the temporal information, which is the scene's motion independent of the camera motion. We fuse this information to receive an effective and efficient approach for robust motion segmentation. The effectiveness is demonstrated using the multi-spectral CDNet-2014 dataset that we re-labeled for motion segmentation. We specifically show that we can detect tiny moving objects significantly better compared to methods based on optical flow.
We introduce À-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolat...
详细信息
Fine-tuning through knowledge transfer from a pre-trained model on a large-scale dataset is a widely spread approach to effectively build models on small-scale datasets. In this work, we show that a recent adversarial...
详细信息
ISBN:
(纸本)9781665448994
Fine-tuning through knowledge transfer from a pre-trained model on a large-scale dataset is a widely spread approach to effectively build models on small-scale datasets. In this work, we show that a recent adversarial attack designed for transfer learning via re-training the last linear layer can successfully deceive models trained with transfer learning via end-to-end fine-tuning. This raises security concerns for many industrial applications. In contrast, models trained with random initialization without transfer are much more robust to such attacks, although these models often exhibit much lower accuracy. To this end, we propose noisy feature distillation, a new transfer learning method that trains a network from random initialization while achieving clean-data performance competitive with fine-tuning.
In recent years self-supervised learning has emerged as a promising candidate for unsupervised representation learning. In the visual domain its applications are mostly studied in the context of images of natural scen...
详细信息
ISBN:
(纸本)9781665448994
In recent years self-supervised learning has emerged as a promising candidate for unsupervised representation learning. In the visual domain its applications are mostly studied in the context of images of natural scenes. However, its applicability is especially interesting in specific areas, like remote sensing and medicine, where it is hard to obtain huge amounts of labeled data. In this work, we conduct an extensive analysis of the applicability of self-supervised learning in remote sensing image classification. We analyze the influence of the number and domain of images used for self-supervised pre-training on the performance on downstream tasks. We show that, for the downstream task of remote sensing image classification, using self-supervised pre-training on remote sensing images can give better results than using supervised pre-training on images of natural scenes. Besides, we also show that self-supervised pre-training can be easily extended to multispectral images producing even better results on our downstream tasks.
Due to the possibility of automatically verifying an individual’s identity by comparing his/her face with that present in a personal identification document, systems providing identification must be equipped with dig...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Due to the possibility of automatically verifying an individual’s identity by comparing his/her face with that present in a personal identification document, systems providing identification must be equipped with digital manipulation detectors. Morphed facial images can be considered a threat among other manipulations because they are visually indistinguishable from authentic facial photos. They can have characteristics of many possible subjects due to the nature of the attack. Thus, morphing attack detection methods (MADs) must be integrated into automated face recognition. Following the recent advances in MADs, we investigate their effectiveness by proposing an integrated system simulator of real application contexts, moving from known to never-seen-before attacks.
Traffic Anomaly detection is an essential computervision task and plays a critical role in video structure analysis and urban traffic analysis. In this paper, we propose a box-level tracking and refinement algorithm ...
详细信息
ISBN:
(纸本)9781665448994
Traffic Anomaly detection is an essential computervision task and plays a critical role in video structure analysis and urban traffic analysis. In this paper, we propose a box-level tracking and refinement algorithm to identify anomaly detection in road scenes. We first link the detection results to construct candidate spatio-temporal tubes via greedy search. Then the box-level refinement scheme is introduced to employ auxiliary detection cues to promote the abnormal predictions, which consists of spatial fusion, still-thing filter, temporal fusion, and feedforward optimization. Still-thing filter and feedforward optimization employ complementary detection concepts to promote the abnormal predictions, which helps determine an accurate abnormal period. The experimental results show that our approach is superior in the Traffic Anomaly Detection Track test set of the NVIDIA AI CITY 2021 CHALLENGE, which ranked second in this competition, with a 93.18% F1-score and 3.1623 root mean square error. It reveals that the proposed approach contributes to fine-grained anomaly detection in actual traffic accident scenarios and promoting the development of intelligent transportation.
Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vo...
详细信息
ISBN:
(纸本)9781665448994
Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.
Many important yet not fully resolved problems in computational photography and image enhancement, e.g. generating well-lit images from their low-light counterparts or producing RGB images from their RAW camera inputs...
详细信息
ISBN:
(纸本)9781665448994
Many important yet not fully resolved problems in computational photography and image enhancement, e.g. generating well-lit images from their low-light counterparts or producing RGB images from their RAW camera inputs share a common nature: discovering a color mapping between input pixels to output pixels based on both global information and local details. We propose a novel deep neural network architecture to learn the RAW to RGB mapping based on this common nature. This architecture consists of both global and local sub-networks, where the first sub-network focuses on determining illumination and color mapping, the second sub-network deals with recovering image details. The result of the global network serves as a guidance to the local network to form the final RGB images. Our method outperforms state-of-the-art with a significantly smaller size of network features on various image enhancement tasks.
We introduce a novel method for collecting table tennis video data and perform stroke detection and classification. A diverse dataset containing video data of 11 basic strokes obtained from 14 professional table tenni...
详细信息
ISBN:
(纸本)9781665448994
We introduce a novel method for collecting table tennis video data and perform stroke detection and classification. A diverse dataset containing video data of 11 basic strokes obtained from 14 professional table tennis players, summing up to a total of 22111 videos has been collected using the proposed setup. The temporal convolutional neural network model developed using 2D pose estimation performs multiclass classification of these 11 table tennis strokes with a validation accuracy of 99.37%. Moreover, the neural network generalizes well over the data of a player excluded from the training and validation dataset, classifying the fresh strokes with an overall best accuracy of 98.72%. Various model architectures using machine learning and deep learning based approaches have been trained for stroke recognition and their performances have been compared and benchmarked. Inferences such as performance monitoring and stroke comparison of the players using the model have been discussed. Therefore, we are contributing to the development of a computervision based sports analytics system for the sport of table tennis that focuses on the previously unexploited aspect of the sport i.e., a player's strokes, which is extremely insightful for performance improvement.
Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created...
详细信息
ISBN:
(纸本)9781665448994
Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created from the input data. Yet, predicting this known information helps in learning representations useful for downstream tasks. However, recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. To address the issue of self-supervised pre-training of smaller models, we propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. We employ deep mutual learning strategy in which two models collaboratively learn from each other to improve one another. Specifically, each model is trained using self-supervised learning along with distillation that aligns each model's softmax probabilities of similarity scores with that of the peer model. We conduct extensive experiments on multiple benchmark datasets, learning objectives, and architectures to demonstrate the potential of our proposed method. Our results show significant performance gain in the presence of noisy and limited labels, and in generalization to out-of-distribution data.
暂无评论