Evolutionary deep intelligence has recently shown great promise for producing small, powerful deep neural network models via the synthesis of increasingly efficient architectures over successive generations. Despite r...
详细信息
ISBN:
(数字)9781728125060
ISBN:
(纸本)9781728125060
Evolutionary deep intelligence has recently shown great promise for producing small, powerful deep neural network models via the synthesis of increasingly efficient architectures over successive generations. Despite recent research showing the efficacy of multi-parent evolutionary synthesis, little has been done to directly assess architectural similarity between networks during the synthesis process for improved parent network selection. In this work, we present a preliminary study into quantifying architectural similarity via the percentage overlap of architectural clusters. Results show that networks synthesized using architectural alignment (via gene tagging) maintain higher architectural similarities within each generation, potentially restricting the search space of highly efficient network architectures.
Anticipating an action that is about to happen allows us to be more efficient in interacting with our environment. However, prediction is a challenging task in computervision, because videos are only partially availa...
详细信息
ISBN:
(纸本)9781728125060
Anticipating an action that is about to happen allows us to be more efficient in interacting with our environment. However, prediction is a challenging task in computervision, because videos are only partially available when a decision is to be made. Complicating the issue is that it is not always clear which of the visible activities in the scene are relevant to the action, and which ones are not. We suggest that the key to recognizing an action lies with the human actors, and that it is therefore necessary for the prediction process to attend to persons in a scene. In our work, we extract fine-grained features on visible human actors and predict the future via an L2-regression in feature space. This allows the regressed future feature to focus on the actor. Using this, the future action is classified. More specifically, the fine-grained extraction is guided by a pose prediction system that models current and future human poses in the scene. We run qualitative and quantitative experiments on the Charades dataset, and initial results show that our system improves action prediction.
Infrared (IR) images are characterized by a lower sensitivity to lighting conditions than the visible spectrum. This opens the door to relatively untapped research potential of automatic recognition systems that are r...
详细信息
ISBN:
(纸本)9781728125060
Infrared (IR) images are characterized by a lower sensitivity to lighting conditions than the visible spectrum. This opens the door to relatively untapped research potential of automatic recognition systems that are robust to shadows and variability in illumination levels or appearance. IR action recognition (AR) is one such application. It remains a fairly unexplored domain in IR. As such, in this paper, we propose the use of hidden Markov models (HMM) for IR AR. We also derive the mathematical model for the variational learning ofBeta-Liouville (BL) HMMs. Next, we present the results of the proposed model on the Infrared Action recognition (InfAR) dataset. To the best of our knowledge, this is the first application of HMMs to AR in the IR domain, and the first application of the BL HMMs to AR. Experimental results demonstrate promising results using different features extracted from the InfAR dataset.
We propose the Disentangled Representation-learning Wasserstein GAN (DR-WGAN) trained on augmented data for face recognition and face synthesis across pose. We improve the state-of-the-art DR-GAN with the Wasserstein ...
详细信息
ISBN:
(纸本)9781728125060
We propose the Disentangled Representation-learning Wasserstein GAN (DR-WGAN) trained on augmented data for face recognition and face synthesis across pose. We improve the state-of-the-art DR-GAN with the Wasserstein loss considered in the discriminator so that the generative and adversarial framework can be better trained. The improved training leads to better face disentanglement and synthesis. We also highlight the influences of imbalanced training data on the disentangled facial representation learning, and point out the difficulty of generating faces of extreme poses. We explore the recently proposed nonlinear 3D Morphable Model (3DMM) to augment the training data, and verify the contributions made by the learning on augmented data. Additionally, we also compare different data normalization schemes and reveal the benefit of using the group normalization. The proposed framework is verified through the experiments on benchmark databases, and compared with contemporary approaches for performance evaluation.
This paper presents a new algorithm for enforcing temporal coherence on depth estimation from multi-view videos of dynamic scenes as well as the first substantial quantitative evaluation of the improvement in depth es...
详细信息
ISBN:
(纸本)9781728125060
This paper presents a new algorithm for enforcing temporal coherence on depth estimation from multi-view videos of dynamic scenes as well as the first substantial quantitative evaluation of the improvement in depth estimation accuracy due to temporal coherence. The proposed algorithm is generally applicable and practical since it bypasses explicit scene flow estimation, which has a very large state space, and relies only on optical flow which is used to impose soft constraints on depth estimation for the next frame. As a result, our algorithm is applicable to scenes with large depth and motion ranges. The output is a sequence of depth maps that can be used for novel view synthesis among other applications. While it is intuitive that enforcing temporal coherence should improve the accuracy of depth estimation, this improvement has never been assessed quantitatively due to the lack of data with ground truth. To overcome this limitation we use the image prediction error as the criterion and show that the benefits of temporal coherence are significant on a diverse set of multi-view video sequences.
We present a new end-to-end network architecture for facial expression recognition with an attention model. It focuses attention in the human face and uses a Gaussian space representation for expression recognition. W...
详细信息
ISBN:
(纸本)9781728125060
We present a new end-to-end network architecture for facial expression recognition with an attention model. It focuses attention in the human face and uses a Gaussian space representation for expression recognition. We devise this architecture based on two fundamental complementary components: (I) facial image correction and attention and (2) facial expression representation and classification. The first component uses an encoder-decoder style network and a convolutional feature extractor that are pixel-wise multiplied to obtain a feature attention map. The second component is responsible for obtaining an embedded representation and classification of the facial expression. We propose a loss function that creates a Gaussian structure on the representation space. To demonstrate the proposed method, we create two larger and more comprehensive synthetic datasets using the traditional BU3DFE and CK+ facial datasets. We compared results with the Pre-ActResNet18 baseline. Our experiments on these datasets have shown the superiority of our approach in recognizing facial expressions.
In this paper, we propose a three-stream convolutional neural network (3SCNN) for action recognition from skeleton sequences, which aims to thoroughly and fully exploit the skeleton data by extracting, learning, fusin...
详细信息
ISBN:
(纸本)9781728125060
In this paper, we propose a three-stream convolutional neural network (3SCNN) for action recognition from skeleton sequences, which aims to thoroughly and fully exploit the skeleton data by extracting, learning, fusing and inferring multiple motion-related features, including 3D joint positions andjoint displacements across adjacent frames as well as oriented bone segments. The proposed 3SCNN involves three sequential stages. The first stage enriches three independently extracted features by co-occurrence feature learning. The second stage involves multi-channel pairwise fusion to take advantage of the complementary and diverse nature among three features. The third stage is a multi-task and ensemble learning network to further improve the generalization ability of 3SCNN. Experimental results on the standard dataset show the effectiveness of our proposed multi-stream feature learning, fusion and inference method for skeleton-based 3D action recognition.
Recent work shows unequal performance of commercial face classification services in the gender classification task across intersectional groups defined by skin type and gender. Accuracy on dark-skinned females is sign...
详细信息
ISBN:
(数字)9781728125060
ISBN:
(纸本)9781728125060
Recent work shows unequal performance of commercial face classification services in the gender classification task across intersectional groups defined by skin type and gender. Accuracy on dark-skinned females is significantly worse than on any other group. We pro-vide initial evidence that skin type alone is not the driver for this disparity by conducting novel stability experiments that vary an image's skin type via color-theoretic methods, namely luminance mode-shift and optimal transport. We evaluate the effect of skin (rim change on the gender classification decision of a pair of state-of-the-art commercial and open-source gender classifiers. The results raise the possibility that broader *** in ethnicity, as opposed to the skin type alone, are what contribute to unequal gender classification accuracy in face images.
Most multiple people tracking systems compute trajectories based on the tracking-by-detection paradigm. Consequently, the performance depends to a large extent on the quality of the employed input detections. However,...
详细信息
ISBN:
(纸本)9781728125060
Most multiple people tracking systems compute trajectories based on the tracking-by-detection paradigm. Consequently, the performance depends to a large extent on the quality of the employed input detections. However, despite an enormous progress in recent years, partially occluded people are still often not recognized. Also, many correct detections are mistakenly discarded when the non-maximum suppression is performed. Improving the tracking performance thus requires to augment the coarse input. Well suited for this task are fine-graded body joint detections, as they allow to locate even strongly occluded persons. Thus in this work, we analyze the suitability of including joint detections for multiple people tracking. We introduce different affinities between the two detection types and evaluate their performances. Tracking is then performed within a near-online framework based on a Min cost graph labeling formulation. As a result, our framework can recover heavily occluded persons and solve the data association efficiently. We evaluate our framework on the MOT16/17 benchmark. Experimental results demonstrate that our framework achieves state-of-the-art results.
Automated facial expression classification has widespread application in multiple domains such as human computer interaction, health and entertainment, biometrics, and security. There are six basic facial expressions:...
详细信息
ISBN:
(纸本)9781728125060
Automated facial expression classification has widespread application in multiple domains such as human computer interaction, health and entertainment, biometrics, and security. There are six basic facial expressions: Anger, Disgust, Fear, Happiness, Sadness, and Surprise, apart from a neutral state. Most of the research in expression classification has focused on adult face images, with no dedicated research on automating expression classification for children. To the best of our knowledge, this is the first research which presents a deep learning based expression classification approach for children. A novel supervised deep learning formulation, termed as Mean Supervised Deep Boltzmann Machine (msDBM) is proposed which classifies an input face image into one of the seven expression classes. The proposed approach has been evaluated on two child face datasets - Radboud Faces and CAFE, along with experiments on the adult face images of the Radboud Faces dataset. Experimental results and analysis reinforces the challenging nature of the task at hand, and the effectiveness of the proposed msDBM model.
暂无评论