In this paper, a new skeleton-based approach is proposed for 3D hand gesture recognition. Specifically, we exploit the geometric shape of the hand to extract an effective descriptor from hand skeleton connected joints...
详细信息
ISBN:
(纸本)9781509014378
In this paper, a new skeleton-based approach is proposed for 3D hand gesture recognition. Specifically, we exploit the geometric shape of the hand to extract an effective descriptor from hand skeleton connected joints returned by the Intel RealSense depth camera. Each descriptor is then encoded by a Fisher Vector representation obtained using a Gaussian Mixture Model. A multi-level representation of Fisher Vectors and other skeleton-based geometric features is guaranteed by a temporal pyramid to obtain the final feature vector, used later to achieve the classification by a linear SVM classifier. the proposed approach is evaluated on a challenging hand gesture dataset containing 14 gestures, performed by 20 participants performing the same gesture with two different numbers of fingers. Experimental results show that our skeleton-based approach consistently achieves superior performance over a depth-based approach.
In this paper, we introduce a novel framework for video-based action recognition, which incorporates the sequential information withthe spatiotemporal features. Specifically, the spatiotemporal features are extracted...
详细信息
ISBN:
(纸本)9781509014378
In this paper, we introduce a novel framework for video-based action recognition, which incorporates the sequential information withthe spatiotemporal features. Specifically, the spatiotemporal features are extracted from the sliced clips of videos, and then a recurrent neural network is applied to embed the sequential information into the final feature representation of the video. In contrast to most current deep learning methods for the video-based tasks, our framework incorporates both long-term dependencies and spatiotemporal information of the clips in the video. To extract the spatiotemporal features from the clips, both dense trajectories (DT) and a newly proposed 3D neural network, C3D, are applied in our experiments. Our proposed framework is evaluated on the benchmark datasets of UCF101 and HMDB51, and achieves comparable performance compared withthe state-of-the-art results.
In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using boththe video and sensor data. First, we experiment and extend a multi-stream C...
详细信息
ISBN:
(纸本)9781509014378
In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using boththe video and sensor data. First, we experiment and extend a multi-stream Convolutional Neural Network to learn the spatial and temporal features from egocentric videos. Second, we propose a multi-stream Long Short-Term Memory architecture to learn the features from multiple sensor streams (accelerometer, gyroscope, etc.). third, we propose to use a two-level fusion technique and experiment different pooling techniques to compute the prediction results. Experimental results using a multimodal egocentric dataset show that our proposed method can achieve very encouraging performance, despite the constraint that the scale of the existing egocentric datasets is still quite limited.
e propose a two-level system for apparent age estimation from facial images. Our system first classifies samples into overlapping age groups. Within each group, the apparent age is estimated with local regressors, who...
详细信息
ISBN:
(纸本)9781509014378
e propose a two-level system for apparent age estimation from facial images. Our system first classifies samples into overlapping age groups. Within each group, the apparent age is estimated with local regressors, whose outputs are then fused for the final estimate. We use a deformable parts model based face detector, and features from a pre-trained deep convolutional network. Kernel extreme learning machines are used for classification. We evaluate our system on the ChaLearn Looking at People 2016 - Apparent Age Estimation challenge dataset, and report 0.3740 normal score on the sequestered test set.
Video surveillance systems generated about 65% of the Universe Big Data in 2015. the development of systems for intelligent analysis of such a large amount of data is among the most investigated topics in the academia...
详细信息
ISBN:
(纸本)9781509014378
Video surveillance systems generated about 65% of the Universe Big Data in 2015. the development of systems for intelligent analysis of such a large amount of data is among the most investigated topics in the academia and commercial world. Recent outcomes in knowledge management and computational intelligence demonstrate the effectiveness of semantic technologies in several fields like image and text analysis, hand writing and speech recognition. In this paper a solution that, starting from the output of a people tracking algorithm, is able to recognize simple events (person falling to the ground) and complex ones (person aggression) is presented. the proposed solution uses semantic web technologies for automatically annotating the output produced by the tracking algorithm;a sets of rules for reasoning on these annotated data are also proposed. Such rules allow to define complex analytics functions demonstrating the effectiveness of hybrid approaches for event recognition.
Motivated by the success of CNNs in object recognition on images, researchers are striving to develop CNN equivalents for learning video features. However, learning video features globally has proven to be quite a cha...
详细信息
ISBN:
(纸本)9781509014378
Motivated by the success of CNNs in object recognition on images, researchers are striving to develop CNN equivalents for learning video features. However, learning video features globally has proven to be quite a challenge due to the difficulty of getting enough labels, processing large-scale video data, and representing motion information. therefore, we propose to leverage effective techniques from both data-driven and data-independent approaches to improve action recognition system. Our contribution is three-fold. First, we explicitly show that local handcrafted features and CNNs share the same convolution-pooling network structure. Second, we propose to use independent subspace analysis (ISA) to learn descriptors for state-of-the-art handcrafted features. third, we enhance ISA with two new improvements, which make our learned descriptors significantly outperform the handcrafted ones. Experimental results on standard action recognition benchmarks show competitive performance.
Background subtraction is a basic problem for change detection in videos and also the first step of high-level computervision applications. Most background subtraction methods rely on color and texture feature. Howev...
详细信息
ISBN:
(纸本)9781509014378
Background subtraction is a basic problem for change detection in videos and also the first step of high-level computervision applications. Most background subtraction methods rely on color and texture feature. However, due to illuminations changes in different scenes and affections of noise pixels, those methods often resulted in high false positives in a complex environment. To solve this problem, we propose an adaptive background subtraction model which uses a novel Local SVD Binary pattern (named LSBP) feature instead of simply depending on color intensity. this feature can describe the potential structure of the local regions in a given image, thus, it can enhance the robustness to illumination variation, noise, and shadows. We use a sample consensus model which is well suited for our LSBP feature. Experimental results on CDnet 2012 dataset demonstrate that our background subtraction method using LSBP feature is more effective than many state-of-the-art methods.
Human action recognition has emerged as one of the most challenging and active areas of research in the computervision domain. In addition to pose variation and scale variability, high complexity of human motions and...
详细信息
ISBN:
(纸本)9781509014378
Human action recognition has emerged as one of the most challenging and active areas of research in the computervision domain. In addition to pose variation and scale variability, high complexity of human motions and the variability of object interactions represent additional significant challenges. In this paper, we present an approach for human-object interaction modeling and classification. Towards that goal, we adopt relevant frame-level features;the inter-joint distances and joints-object distances. these proposed features are efficiently insensitive to position and pose variation. the evolution of the these distances in time is modeled by trajectories in a high dimension space and a shape analysis framework is used to model and compare the trajectories corresponding to human-object interaction in a Riemannian manifold. the experiments conducted following state-of-the-art settings and results demonstrate the strength of the proposed method. Using only the skeletal information, we achieve state-of-the-art classification results on the benchmark dataset.
We present a polarimetric thermal face database, the first of its kind, for face recognition research. this database was acquired using a polarimetric longwave infrared imager, specifically a division-of-time spinning...
详细信息
ISBN:
(纸本)9781509014378
We present a polarimetric thermal face database, the first of its kind, for face recognition research. this database was acquired using a polarimetric longwave infrared imager, specifically a division-of-time spinning achromatic retarder system. A corresponding set of visible spectrum imagery was also collected, to facilitate cross-spectrum (also referred to as heterogeneous) face recognition research. the database consists of imagery acquired at three distances under two experimental conditions: neutral/baseline condition, and expressions condition. Annotations (spatial coordinates of key fiducial points) are provided for all images. Cross-spectrum face recognition performance on the database is benchmarked using three techniques: partial least squares, deep perceptual mapping, and coupled neural networks.
In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant code-words and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximatio...
详细信息
ISBN:
(纸本)9781509014378
In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant code-words and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximations of human motion primitives;these are mid-level representations, and they adaptively integrate dominant vectors when classifying human activities. In LDA topic modeling, action videos (documents) are represented by a bag-of-words (input from a dictionary), and these are based on improved dense trajectories ([18]). the output topics correspond to human motion primitives, such as finger moving or subtle leg motion. We eliminate the impurities, such as missed tracking or changing light conditions, in each motion primitive. the assembled vector of motion primitives is an improved representation of the action. We demonstrate our method on four different datasets.
暂无评论