Anomaly detection in videos is an important computervision problem with various applications including automated video surveillance. Although adversarial attacks on image understanding models have been heavily invest...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Anomaly detection in videos is an important computervision problem with various applications including automated video surveillance. Although adversarial attacks on image understanding models have been heavily investigated, there is not much work on adversarial machine learning targeting video understanding models and no previous work which focuses on video anomaly detection. To this end, we investigate an adversarial machine learning attack against video anomaly detection systems, that can be implemented via an easy-to-perform cyber-attack. Since surveillance cameras are usually connected to the server running the anomaly detection model through a wireless network, they are prone to cyber-attacks targeting the wireless connection. We demonstrate how Wi-Fi deauthentication attack, a notoriously easy-to-perform and effective denial-of-service (DoS) attack, can be utilized to generate adversarial data for video anomaly detection systems. Specifically, we apply several effects caused by the Wi-Fi deauthentication attack on video quality (e.g., slow down, freeze, fast forward, low resolution) to the popular benchmark datasets for video anomaly detection. Our experiments with several state-of-the-art anomaly detection models show that the attackers can significantly undermine the reliability of video anomaly detection systems by causing frequent false alarms and hiding physical anomalies from the surveillance system.
Semantic segmentation is a challenging task since it requires excessively more low-level spatial information of the image compared to other computervision problems. The accuracy of pixel-level classification can be a...
详细信息
ISBN:
(纸本)9781665487399
Semantic segmentation is a challenging task since it requires excessively more low-level spatial information of the image compared to other computervision problems. The accuracy of pixel-level classification can be affected by many factors, such as imaging limitations and the ambiguity of object boundaries in an image. Conventional methods exploit three-channel RGB images captured in the visible spectrum with deep neural networks (DNN). Thermal images can significantly contribute during the segmentation since thermal imaging cameras are capable of capturing details despite the weather and illumination conditions. Using infrared spectrum in semantic segmentation has many real-world use cases, such as autonomous driving, medical imaging, agriculture, defense industry, etc. Due to this wide range of use cases, designing accurate semantic segmentation algorithms with the help of infrared spectrum is an important challenge. One approach is to use both visible and infrared spectrum images as inputs. These methods can accomplish higher accuracy due to enriched input information, with the cost of extra effort for the alignment and processing of multiple inputs. Another approach is to use only thermal images, enabling less hardware cost for smaller use cases. Even though there are multiple surveys on semantic segmentation methods, the literature lacks a comprehensive survey centered explicitly around semantic segmentation using infrared spectrum. This work aims to fill this gap by presenting algorithms in the literature and categorizing them by their input images.
A common paradigm in deep learning applications for computervision is self-supervised pretraining followed by supervised fine-tuning on a target task. In the self-supervision step, a model is trained in a supervised ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
A common paradigm in deep learning applications for computervision is self-supervised pretraining followed by supervised fine-tuning on a target task. In the self-supervision step, a model is trained in a supervised fashion, but the source of supervision needs to be implicitly defined by the data. Image-caption alignment is often used as such a source of implicit supervision in multimodal pretraining, and grounding (i.e., matching word tokens with visual tokens) is one way to exploit it. We introduce a strategy to take advantage of an underexplored structure in image-caption datasets: the relationship between captions matched with different images but mentioning the same objects. Given an image-caption pair, we find an additional caption that mentions one of the objects the first caption mentions, and we impose a sparse grounding between the image and the second caption so that only a few word tokens are grounded in the image. Our goal is to learn a better feature representation for the objects mentioned by both captions, encouraging grounding between the additional caption and the image to focus on the common objects only. We report superior grounding performance when comparing our approach with a previously-published pretraining strategy, and we show the benefit of our proposed double-caption grounding on two downstream detection tasks: supervised detection and open-vocabulary detection.
Self-supervised learning has proved to be a powerful approach to learn image representations without the need of large labeled datasets. For underwater robotics, it is of great interest to design computervision algor...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Self-supervised learning has proved to be a powerful approach to learn image representations without the need of large labeled datasets. For underwater robotics, it is of great interest to design computervision algorithms to improve perception capabilities such as sonar image classification. Due to the confidential nature of sonar imaging and the difficulty to interpret sonar images, it is challenging to create public large labeled sonar datasets to train supervised learning algorithms. In this work, we investigate the potential of three self-supervised learning methods (RotNet, Denoising Autoencoders, and Jigsaw) to learn high-quality sonar image representation without the need of human labels. We present pre-training and transfer learning results on real-life sonar image datasets. Our results indicate that self-supervised pre-training yields classification performance comparable to supervised pre-training in a few-shot transfer learning setup across all three methods. Code and self-supervised pre-trained models are be available at agrija9/ssl-sonar-images.
Anomaly detection is a well-established research area that seeks to identify samples outside of a predetermined distribution. An anomaly detection pipeline is comprised of two main stages: (1) feature extraction and (...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Anomaly detection is a well-established research area that seeks to identify samples outside of a predetermined distribution. An anomaly detection pipeline is comprised of two main stages: (1) feature extraction and (2) normality score assignment. Recent papers used pre-trained networks for feature extraction achieving state-of-the-art results. However, the use of pre-trained networks does not fully-utilize the normal samples that are available at train time. This paper suggests taking advantage of this information by using teacher-student training. In our setting, a pretrained teacher network is used to train a student network on the normal training samples. Since the student network is trained only on normal samples, it is expected to deviate from the teacher network in abnormal cases. This difference can serve as a complementary representation to the pre-trained feature vector. Our method - Transformaly - exploits a pre-trained vision Transformer (ViT) to extract both feature vectors: the pre-trained (agnostic) features and the teacher-student (fine-tuned) features. We report state-of-the-art AUROC results in both the common unimodal setting, where one class is considered normal and the rest are considered abnormal, and the multimodal setting, where all classes but one are considered normal, and just one class is considered abnormal(1).
In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the ...
详细信息
ISBN:
(纸本)9781665487399
In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computervision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.
We propose a remote method to estimate continuous blood pressure based on spatial information of a pulse wave at a single point in time. By setting regions of interest to cover a face in a mutually exclusive and colle...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We propose a remote method to estimate continuous blood pressure based on spatial information of a pulse wave at a single point in time. By setting regions of interest to cover a face in a mutually exclusive and collectively exhaustive manner, RGB facial video is converted into a spatial pulse wave signal. The spatial pulse wave signal is converted into spatial signals of contours of each segmented pulse beat and relationships of each segmented pulse beat. The spatial signal is represented as a time-continuous value based on a representation of a pulse contour in a time axis and a phase axis and an interpolation along with the time axis. A relationship between the spatial signals and blood pressure is modeled by a convolutional neural network. A dataset was built to demonstrate the effectiveness of the proposed method. The dataset consists of continuous blood pressure and facial RGB videos of ten healthy volunteers. A comparison of conventional methods with the proposed method shows superior error for the latter. The results show an adequate estimation of the performance of the proposed method, when compared to the ground truth in mean blood pressure, in both the correlation coefficient (0.85) and mean absolute error (5.4 mmHg).
Modern surveillance systems have become increasingly dependent on artificial intelligence to provide actionable information for real-time decision making. A critical question relates to how these systems handle diffic...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Modern surveillance systems have become increasingly dependent on artificial intelligence to provide actionable information for real-time decision making. A critical question relates to how these systems handle difficult ethical dilemmas, such as the re-identification of similar looking individuals. Potential misidentification of individuals can have severe negative consequences, as evidenced by recent headlines of individuals who were wrongly targeted for crimes they did not commit based on false matches. A computervision-based saliency algorithm is proposed to help identify pixel-level differences in pairs of images containing visually similar individuals, which we term "doppelgangers." The computed saliency maps can alert human users of the presence of doppelg angers and provide important visual evidence to reduce the potential of false matches in these high-stakes situations. We show both qualitative and quantitative saliency results on doppelg angers found in a video-based person re-identification dataset (MARS) using three different state-of-the-art models. Our results suggest that this novel use of visual saliency can improve overall outcomes by helping human users in the person re-identification setting, while assuring the ethical and trusted operation of surveillance systems.
Although recent deep neural network algorithm has shown tremendous success in several computervision tasks, their vulnerability against minute adversarial perturbations has raised a serious concern. In the early days...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Although recent deep neural network algorithm has shown tremendous success in several computervision tasks, their vulnerability against minute adversarial perturbations has raised a serious concern. In the early days of crafting these adversarial examples, artificial noises are optimized through the network and added in the images to decrease the confidence of the classifiers against the true class. However, recent efforts are showcasing the presence of natural adversarial examples which can also be effectively used to fool the deep neural networks with high confidence. In this paper, for the first time, we have raised the question that whether there is any robustness connection between artificial and natural adversarial examples. The possible robustness connection between natural and artificial adversarial examples is studied in the form that whether an adversarial example detector trained on artificial examples can detect the natural adversarial examples. We have analyzed several deep neural networks for the possible detection of artificial and natural adversarial examples in seen and unseen settings to set up a robust connection. The extensive experimental results reveal several interesting insights to defend the deep classifiers whether vulnerable against natural or artificially perturbed examples. We believe these findings can pave a way for the development of unified resiliency because defense against one attack is not sufficient for real-world use cases.
Automatic Facial Expression recognition (FER) has attracted increasing attention in the last 20 years since facial expressions play a central role in human communication. Most FER methodologies utilize Deep Neural Net...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Automatic Facial Expression recognition (FER) has attracted increasing attention in the last 20 years since facial expressions play a central role in human communication. Most FER methodologies utilize Deep Neural Networks (DNNs) that are powerful tools when it comes to data analysis. However, despite their power, these networks are prone to overfitting, as they often tend to memorize the training data. What is more, there are not currently a lot of in-the-wild (i.e. in unconstrained environment) large databases for FER. To alleviate this issue, a number of data augmentation techniques have been proposed. Data augmentation is a way to increase the diversity of available data by applying constrained transformations on the original data. One such technique, which has positively contributed to various classification tasks, is Mixup. According to this, a DNN is trained on convex combinations of pairs of examples and their corresponding labels. In this paper, we examine the effectiveness of Mixup for in-the-wild FER in which data have large variations in head poses, illumination conditions, backgrounds and contexts. We then propose a new data augmentation strategy which is based on Mixup, called MixAugment. According to this, the network is trained concurrently on a combination of virtual examples and real examples;all these examples contribute to the overall loss function. We conduct an extensive experimental study that proves the effectiveness of MixAugment over Mixup and various state-of-the-art methods. We further investigate the combination of dropout with Mixup and MixAugment, as well as the combination of other data augmentation techniques with MixAugment.
暂无评论