Facial expressions have essential cues to infer the humans state of mind, that conveys adequate information to understand individuals' actual feelings. thus, automatic facial expression recognition is an interesti...
详细信息
ISBN:
(纸本)9781450366151
Facial expressions have essential cues to infer the humans state of mind, that conveys adequate information to understand individuals' actual feelings. thus, automatic facial expression recognition is an interesting and crucial task to interpret the humans cognitive state through the machine. In this paper, we proposed an Exigent Features Preservative Network (EXPERTNet), to describe the features of the facial expressions. the EXPERTNet extracts only pertinent features and neglect others by using exigent feature (ExFeat) block, mainly comprises of elective layer. Specifically, elective layer selects the desired edge variation features from the previous layer outcomes, which are generated by applying different sized filters as 1 x 1, 3 x 3, 5 x 5 and 7 x 7. Different sized filters aid to elicits both micro and high-level features that enhance the learnability of neurons. ExFeat block preserves the spatial structural information of the facial expression, which allows to discriminate between different classes of facial expressions. Visual representation of the proposed method over different facial expressions shows the learning capability of the neurons of different layers. Experimental and comparative analysis results over four comprehensive datasets: CK+, MMI DISFA and GEMEP-FERA, ensures the better performance of the proposed network as compared to existing networks.
Describing the contents of an image automatically has been a fundamental problem in the field of artificial intelligence and computervision. Existing approaches are either top-down, which start from a simple represen...
详细信息
ISBN:
(纸本)9781450366151
Describing the contents of an image automatically has been a fundamental problem in the field of artificial intelligence and computervision. Existing approaches are either top-down, which start from a simple representation of an image and convert it into a textual description;or bottom-up, which come up with attributes describing numerous aspects of an image to form the caption or a combination of both. Recurrent neural networks (RNN) enhanced by Long Short-Term Memory networks (LSTM) have become a dominant component of several frameworks designed for solving the image captioning task. Despite their ability to reduce the vanishing gradient problem, and capture dependencies, they are inherently sequential across time. In this work, we propose two novel approaches, a top-down and a bottom-up approach independently, which dispenses the recurrence entirely by incorporating the use of a Transformer, a network architecture for generating sequences relying entirely on the mechanism of attention. Adaptive positional encodings for the spatial locations in an image and a new regularization cost during training is introduced. the ability of our model to focus on salient regions in the image automatically is demonstrated visually. Experimental evaluation of the proposed architecture on the MS-COCO dataset is performed to exhibit the superiority of our method.
In this paper, we propose a word segmentation method that is based on fringe maps on Telugu script. Our objective is to create a data set of word images for enabling direct training for recognition on those. the stand...
详细信息
this paper presents an online handwritten benchmark dataset (OHWR-Gurmukhi) for Gurmukhi script. TIET, Patiala released the unconstrained online handwriting databases, OHWR-GNumerals and OHWR-GScript, which contain is...
详细信息
the rapid advances in mobile and networking technologies results in usage of mobiles for critical applications like m-commerce, m-payments etc. Even though mobile based services offer many benefits, authenticating the...
详细信息
Conventional convolutional neural networks (CNN) are trained on large domain datasets and are hence typically over-represented and inefficient in limited class applications. An efficient way to convert such large many...
详细信息
ISBN:
(纸本)9781450366151
Conventional convolutional neural networks (CNN) are trained on large domain datasets and are hence typically over-represented and inefficient in limited class applications. An efficient way to convert such large many-class pre-trained networks into small few-class networks is through a hierarchical decomposition of its feature maps. To alleviate this issue, we propose an automated framework for such decomposition in Hierarchically Self Decomposing CNN (HSD-CNN), in four steps. HSD-CNN is derived automatically using a class-specific filter sensitivity analysis that quantifies the impact of specific features on a class prediction. the decomposed hierarchical network can be utilized and deployed directly to obtain sub-networks for a subset of classes, and it is shown to perform better without the requirement of retraining these sub-networks. Experimental results show that HSD-CNN generally does not degrade accuracy if the full set of classes is used. Interestingly, when operating on known subsets of classes, HSD-CNN has an improvement in accuracy with a much smaller model size requiring much fewer operations. HSD-CNN flow is verified on the CIFAR10, CIFAR100 and CALTECH101 datasets. We report accuracies up to 85.6% (94.75%) on scenarios with 13 (4) classes of CIFAR100, using a pre-trained VGG-16 network on the full dataset. In this case, the proposed HSD-CNN requires 3.97x fewer parameters and has 71.22% savings in operations, in comparison to baseline VGG-16 containing features for all 100 classes.
In this work, we propose a computationally efficient compressive sensing based approach for very low bit rate lossy coding of hyperspectral (HS) image data by exploiting the redundancy inherent in this imaging modalit...
详细信息
ISBN:
(纸本)9781450366151
In this work, we propose a computationally efficient compressive sensing based approach for very low bit rate lossy coding of hyperspectral (HS) image data by exploiting the redundancy inherent in this imaging modality. We divide the HS datacube into subsets of adjacent bands, each of which is encoded into a coded snapshot using a random code matrix. these coded snapshot images are encoded using the wavelet-based SPIHT compression technique. the decompression from the coded snapshots at the receiver is done using the orthogonal matching pursuit withthe help of an overcomplete dictionary learned on a general purpose training dataset. We provide ample experimental results and performance comparisons to substantiate the usefulness of the proposed method. In the proposed technique the encoder is free from any decoder and it offers a significant saving in computation and yet yields a much higher compression quality.
Egocentric activity recognition (EAR) is an emerging area in the field of computervision research. Motivated by the current success of Convolutional Neural Network (CNN), we propose a multi-stream CNN for multimodal ...
详细信息
ISBN:
(纸本)9781450366151
Egocentric activity recognition (EAR) is an emerging area in the field of computervision research. Motivated by the current success of Convolutional Neural Network (CNN), we propose a multi-stream CNN for multimodal egocentric activity recognition using visual (RGB videos) and sensor stream (accelerometer, gyroscope, etc.). In order to effectively capture the spatio-temporal information contained in RGB videos, two types of modalities are extracted from visual data: Approximate Dynamic image (ADI) and Stacked Difference image (SDI). these image-based representations are generated both at clip level as well as entire video level, and are then utilized to finetune a pretrained 2D-CNN called MobileNet, which is specifically designed for mobile vision applications. Similarly for sensor data, each training sample is divided into three segments, and a deep 1D-CNN network is trained (corresponding to each type of sensor stream) from scratch. During testing, the softmax scores of all the streams (visual + sensor) are combined by late fusion. the experiments performed on multimodal egocentric activity dataset demonstrates that our proposed approach can achieve state-of-the-art results, outperforming the current best handcrafted and deep learning based techniques.
In practice, images can contain different amounts of noise for different color channels, which is not acknowledged by existing super-resolution approaches. In this paper, we propose to super-resolve noisy color images...
详细信息
ISBN:
(纸本)9781450366151
In practice, images can contain different amounts of noise for different color channels, which is not acknowledged by existing super-resolution approaches. In this paper, we propose to super-resolve noisy color images by considering the color channels jointly. Noise statistics are blindly estimated from the input low-resolution image and are used to assign different weights to different color channels in the data cost. Implicit low-rank structure of visual data is enforced via nuclear norm minimization in association with adaptive weights, which is added as a regularization term to the cost. Additionally, multi-scale details of the image are added to the model through another regularization term that involves projection onto PCA basis, which is constructed using similar patches extracted across different scales of the input image. the results demonstrate the super-resolving capability of the approach in real scenarios.
Face Recognition (FR) under adversarial conditions has been a big challenge for researchers in the computervision and Machine Learning communities in the recent past. Most of state-of-the-art face recognition systems...
详细信息
ISBN:
(纸本)9781450366151
Face Recognition (FR) under adversarial conditions has been a big challenge for researchers in the computervision and Machine Learning communities in the recent past. Most of state-of-the-art face recognition systems have been designed to overcome degradations in a face due to variations in pose, illumination, contrast, resolution, along with blur. However, interestingly none have addressed the fascinating issue of makeup as a spoof attack, which drastically changes the appearance of a face, making it difficult for even humans to detect and identify the impostor. In this paper, we propose a novel multi-component deep convolutional neural network (CNN) based architecture which performs the complex task of makeup removal from a disguised face, to reveal the original mugshot image of the impostor (i.e. without makeup). the proposed network also performs the hard tasks of FR on a disguised face in addition to recognition of identity and generation of the face of the spoofed target, by minimizing a novel multi-component objective function. Comparison of performance with a few recent state-of-the-art methods of FR over three benchmark datasets reveals the superiority of our proposed method for both synthesis as well as recognition (FR) tasks.
暂无评论