Facial micro-expressions are brief, rapid, spontaneous gestures of the facial muscles that express an individual's genuine emotions. Because of their short duration and subtlety, detecting and classifying these mi...
详细信息
ISBN:
(纸本)9781665448994
Facial micro-expressions are brief, rapid, spontaneous gestures of the facial muscles that express an individual's genuine emotions. Because of their short duration and subtlety, detecting and classifying these micro-expressions by humans and machines is difficult. In this paper, a novel approach is proposed that exploits relationships between landmark points and the optical flow patch for the given landmark points. It consists of a two-stream graph attention convolutional network that extracts the relationships between the landmark points and local texture using an optical flow patch. A graph structure is built to draw-out temporal information using the triplet of frames. One stream is for node feature location, and the other one is for a patch of optical-flow information. These two streams (node location stream and optical flow stream) are fused for classification. The results are shown on, CASME II and SAMM, publicly available datasets, for three classes and five classes of micro-expressions. The proposed approach outperforms the state-of-the-art methods for 3 and 5 categories of expressions.
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establis...
详细信息
ISBN:
(纸本)9781665448994
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognized irrespective of the spoken language?". These two questions are important to understand effectiveness and to boost development of multilingual biometric systems. To answer these, we collected a Multilingual Audio-Visual dataset, containing human speech clips of 154 identities with 3 language annotations extracted from various videos uploaded online. Extensive experiments on the two splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.
Few-shot learning (FSL) approaches, mostly neural network-based, are assuming that the pre-trained knowledge can be obtained from base (seen) categories and transferred to novel (unseen) categories. However, the black...
详细信息
ISBN:
(纸本)9781665448994
Few-shot learning (FSL) approaches, mostly neural network-based, are assuming that the pre-trained knowledge can be obtained from base (seen) categories and transferred to novel (unseen) categories. However, the black-box nature of neural networks makes it difficult to understand what is actually transferred, which may hamper its application in some risk-sensitive areas. In this paper, we reveal a new way to perform explainable FSL for image classification, using discriminative patterns and pairwise matching. Experimental results prove that the proposed method can achieve satisfactory explainability on two mainstream datasets. Code is available*.
In the biometrics context, the ability to provide the reasoning behind a decision has been at the core of major research efforts. Explanations serve not only to increase the trust amongst the users of a system, but al...
详细信息
ISBN:
(纸本)9781665448994
In the biometrics context, the ability to provide the reasoning behind a decision has been at the core of major research efforts. Explanations serve not only to increase the trust amongst the users of a system, but also to augment the system's overall accountability and transparency. In this work, we describe a periocular recognition framework that not only performs biometric recognition, but also provides visual representations of the features/regions that supported a decision. Being particularly designed to explain non-match ("impostors") decisions, our solution uses adversarial generative techniques to synthesise a large set of "genuine" image pairs, from where the most similar elements with respect to a query are retrieved. Then, assuming the alignment between the query/retrieved pairs, the element-wise differences between the query and a weighted average of the retrieved elements yields a visual explanation of the regions in the query pair that would have to be different to transform it into a "genuine" pair. Our quantitative and qualitative experiments validate the proposed solution, yielding recognition rates that are similar to the state-of-the-art, but - most importantly - also providing the visual explanations for every decision.
The stomatopod (mantis shrimp) visual system has recently provided a blueprint for the design of paradigm-shifting polarization and multispectral imaging sensors, enabling solutions to challenging medical and remote s...
详细信息
Deep Neural Networks are brittle in that small changes in the input can drastically affect their prediction outcome and confidence. Consequently, research in this area mainly focus on adversarial attacks and defenses....
详细信息
ISBN:
(纸本)9781665448994
Deep Neural Networks are brittle in that small changes in the input can drastically affect their prediction outcome and confidence. Consequently, research in this area mainly focus on adversarial attacks and defenses. In this paper, we take an alternative stance and introduce the concept of Assistive Signals, which are perturbations optimized to improve a model's confidence score regardless if it's under attack or not. We analyze some interesting properties of these assistive perturbations and extend the idea to optimize them in the 3D space simulating different lighting conditions and viewing angles. Experimental evaluations show that the assistive signals generated by our optimization method increase the accuracy and confidence of deep models more than those generated by conventional methods that work in the 2D space. 'Assistive Signals' also illustrate bias of ML models towards certain patterns in real-life objects.
Neural network quantization has achieved a high compression rate using fixed low bit-width representation of weights and activations while maintaining the accuracy of the high-precision original network. However, mixe...
详细信息
ISBN:
(纸本)9781665448994
Neural network quantization has achieved a high compression rate using fixed low bit-width representation of weights and activations while maintaining the accuracy of the high-precision original network. However, mixed precision (per-layer bit-width precision) quantization requires careful tuning to maintain accuracy while achieving further compression and higher granularity than fixed-precision quantization. We propose an accuracy-aware criterion to quantify the layer's importance rank. Our method applies imprinting per layer which acts as a proxy module for accuracy estimation in an efficient way. We rank the layers based on the accuracy gain from previous modules and iteratively quantize first those with less accuracy gain. Previous mixed-precision methods either rely on expensive search techniques such as reinforcement learning (RL) or end-to-end optimization with a lack of interpretation to the quantization configuration scheme. Our method is a one-shot, efficient, accuracy-aware information estimation and thus draws better interpretability to the selected bit-width configuration.
There are many labelled datasets relating to land cover and crop type mapping that cover diverse geographies, agroecologies and land uses. However, these labels are often extremely sparse, particularly in low- and mid...
详细信息
ISBN:
(纸本)9781665448994
There are many labelled datasets relating to land cover and crop type mapping that cover diverse geographies, agroecologies and land uses. However, these labels are often extremely sparse, particularly in low- and middle-income regions, with as few as tens of examples for certain crop types. This makes it challenging to train supervised machine learning models to detect specific crops in satellite observations of these regions. We investigate the utility of model-agnostic meta-learning (MAML) to learn from diverse global datasets and improve performance in data-sparse regions. We find that in a variety of countries (Togo, Kenya and Brazil) and across a variety of tasks (crop type mapping, crop vs. non-crop mapping), MAML improves performance compared to pretrained and random initial weights. We also investigate the utility of MAML for different target data-size regimes. We find MAML outperforms other methods for a wide range of training set sizes and positive to negative label ratios, indicating its general suitability for land use and crop type mapping.
Depth guided any-to-any image relighting aims to generate a relit image from the original image and corresponding depth maps to match the illumination setting of the given guided image and its depth map. To the best o...
详细信息
ISBN:
(纸本)9781665448994
Depth guided any-to-any image relighting aims to generate a relit image from the original image and corresponding depth maps to match the illumination setting of the given guided image and its depth map. To the best of our knowledge, this task is a new challenge that has not been addressed in the previous literature. To address this issue, we propose a deep learning-based neural Single Stream Structure network called S3Net for depth guided image relighting. This network is an encoder-decoder model. We concatenate all images and corresponding depth maps as the input and feed them into the model. The decoder part contains the attention module and the enhanced module to focus on the relighting-related regions in the guided images. Experiments performed on challenging benchmark show that the proposed model achieves the 3rd highest SSIM in the NTIRE 2021 Depth Guided Any-to-any Relighting Challenge.
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video;and (2) if it is ...
详细信息
ISBN:
(数字)9781665469463
ISBN:
(纸本)9781665469463
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video;and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise;(ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin;(iii) we apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2, and weakly supervised video action segmentation on Breakfast-Action. (iv) we use the automatically-aligned HowTo100M annotations for end-to-end finetuning of the backbone model, and obtain improved performance on downstream action recognition tasks.
暂无评论