The bag of visual words (BOW) model is widely used for image representation and classification. Spatial pyramid based feature pooling utilizes the BOW model and is the most popular approach to capture the spatial dist...
详细信息
ISBN:
(纸本)9781509021758
The bag of visual words (BOW) model is widely used for image representation and classification. Spatial pyramid based feature pooling utilizes the BOW model and is the most popular approach to capture the spatial distribution (layout) of local image features, It makes the assumption that the center of an object is aligned with the center of an image, which can lead to misalignment and degradation in performance. In this paper, we propose a method to utilize max pooled features to estimate objects centers and align the spatial pyramid accordingly. We also propose an image representation descriptor robust to misalignments and objects deformations. The experimental results demonstrate that our spatial pyramid alignment method is simple yet efficient in handling misalignments and achieves high object classification accuracy.
For real-time computer vision tasks, binary feature descriptors are an efficient alternative to their real-valued counterparts. While providing comparable results for many applications, the computational complexity of...
详细信息
ISBN:
(纸本)9781538604625
For real-time computer vision tasks, binary feature descriptors are an efficient alternative to their real-valued counterparts. While providing comparable results for many applications, the computational complexity of extracting and processing binary descriptors is significantly lower. In many application scenarios, the local features are transmitted over a channel with limited capacity and processed at a more powerful central processing unit, which requires efficient compression and transmission approaches. In this paper, we present a compression scheme for local binary features, which jointly encodes the descriptors and their respective Bag-of-Words representation using a shared vocabulary between client and server. By sending the visual word index and the entropy-coded residual vector containing the differences between the visual word and the descriptor, we are able to reduce ORB features to 60.62 % of their uncompressed size.
Bag of visual words is a popular model in human action recognition, but usually suffers from loss of spatial and temporal configuration information of local features, and large quantization error in its feature coding...
详细信息
Bag of visual words is a popular model in human action recognition, but usually suffers from loss of spatial and temporal configuration information of local features, and large quantization error in its feature coding procedure. In this paper, to overcome the two deficiencies, we combine sparse coding with spatio-temporal pyramid for human action recognition, and regard this method as the baseline. More importantly, which is also the focus of this paper, we find that there is a hierarchical structure in feature vector constructed by the baseline method. To exploit the hierarchical structure information for better recognition accuracy, we propose a tree regularized classifier to convey the hierarchical structure information. The main contributions of this paper can be summarized as: first, we introduce a tree regularized classifier to encode the hierarchical structure information in feature vector for human action recognition. Second, we present an optimization algorithm to learn the parameters of the proposed classifier. Third, the performance of the proposed classifier is evaluated on YouTube, Hollywood2, and UCF50 datasets, the experimental results show that the proposed tree regularized classifier obtains better performance than SVM and other popular classifiers, and achieves promising results on the three datasets.
Local methods based on spatio-temporal interest points (STIPs) have shown their effectiveness for human action recognition. The bag-of-words (BoW) model has been widely used and dominated in this field. Recently, a la...
详细信息
Local methods based on spatio-temporal interest points (STIPs) have shown their effectiveness for human action recognition. The bag-of-words (BoW) model has been widely used and dominated in this field. Recently, a large number of techniques based on local features including improved variants of the BoW model, sparse coding (SC), Fisher kernels (FK), vector of locally aggregated descriptors (VLAD) as well as the naive Bayes nearest neighbor (NBNN) classifier have been proposed and developed for visual recognition. However, some of them are proposed in the image domain and have not yet been applied to the video domain and it is still unclear how effectively these techniques would perform on action recognition. In this paper, we provide a comprehensive study on these local methods for human action recognition. We implement these techniques and conduct comparison under unified experimental settings on three widely used benchmarks, i.e., the KTH, UCF-YouTube and HMDB51 datasets. We discuss insightfully the findings from the experimental results and draw useful conclusions, which are expected to guide practical applications and future work for the action recognition community. (C) 2016 Elsevier B.V. All rights reserved.
This paper presents a feature encoding scheme for image classification by combining the salient coding method with the category-specific codebooks, which are generated separately using the training images of each cate...
详细信息
This paper presents a feature encoding scheme for image classification by combining the salient coding method with the category-specific codebooks, which are generated separately using the training images of each category. Different from the usual way of concatenating or merging the category codebooks to form a global dictionary, we employ the category codebooks to calculate a type of category-sensitive saliency feature, and then, encode the saliency features to form a representation of image content. Compared to the state-of-the-art methods such as LC-KSVD, the dictionary generation and feature encoding in our scheme are pretty simple, and no complicated optimization is involved. However, our scheme can achieve better, in some cases, significantly better results, in terms of the classification accuracy, than the state-of-the-art methods. Extensive experiments are carried out to show the effectiveness of our method in comparing with various image classification methods. (C) 2015 Elsevier B.V. All rights reserved.
feature coding, which encodes local features extracted from an image with a codebook and generates a set of codes for efficient image representation, has shown very promising results in image classification. Vector qu...
详细信息
feature coding, which encodes local features extracted from an image with a codebook and generates a set of codes for efficient image representation, has shown very promising results in image classification. Vector quantization is the most simple but widely used method for feature coding. However, it suffers from large quantization errors and leads to dissimilar codes for similar features. To alleviate these problems, we propose Laplacian Regularized Locality-constrained coding (LapLLC), wherein a locality constraint is used to favor nearby bases for encoding, and Laplacian regularization is integrated to preserve the code consistency of similar features. By incorporating a set of template features, the objective function used by LapLLC can be decomposed, and each feature is encoded by solving a linear system. Additionally, k nearest neighbor technique is employed to construct a much smaller linear system, so that fast approximated coding can be achieved. Therefore, LapLLC provides a novel way for efficient feature coding. Our experiments on a variety of image classification tasks demonstrated the effectiveness of this proposed approach. (C) 2015 Elsevier B.V. All rights reserved.
Land-use classification of very high spatial resolution remote sensing (VHSR) imagery is one of the most challenging tasks in the field of remote sensing image processing. However, the land-use classification is hard ...
详细信息
Land-use classification of very high spatial resolution remote sensing (VHSR) imagery is one of the most challenging tasks in the field of remote sensing image processing. However, the land-use classification is hard to be addressed by the land-cover classification techniques, due to the complexity of the land-use scenes. Scene classification is considered to be one of the expected ways to address the land-use classification issue. The commonly used scene classification methods of VHSR imagery are all derived from the computer vision community that mainly deal with terrestrial image recognition. Differing from terrestrial images, VHSR images are taken by looking down with airborne and spaceborne sensors, which leads to the distinct light conditions and spatial configuration of land cover in VHSR imagery. Considering the distinct characteristics, two questions should be answered: (1) Which type or combination of information is suitable for the VHSR imagery scene classification? (2) Which scene classification algorithm is best for VHSR imagery? In this paper, an efficient spectral-structural bag-of-features scene classifier (SSBFC) is proposed to combine the spectral and structural information of VHSR imagery. SSBFC utilizes the first- and second-order statistics (the mean and standard deviation values, MeanStd) as the statistical spectral descriptor for the spectral information of the VHSR imagery, and uses dense scale-invariant feature transform (SIFT) as the structural feature descriptor. From the experimental results, the spectral information works better than the structural information, while the combination of the spectral and structural information is better than any single type of information. Taking the characteristic of the spatial configuration into consideration, SSBFC uses the whole image scene as the scope of the pooling operator, instead of the scope generated by a spatial pyramid (SP) commonly used in terrestrial image classification. The experimental
Visual analysis algorithms have been mostly developed for a centralized scenario where all visual data is acquired and processed at a central location. However, in visual sensor networks (VSN), several constraints in ...
详细信息
ISBN:
(纸本)9781479999880
Visual analysis algorithms have been mostly developed for a centralized scenario where all visual data is acquired and processed at a central location. However, in visual sensor networks (VSN), several constraints in computational power, energy and bandwidth require a radically different approach, notably a paradigm shift from centralized to distributed visual processing. In the new paradigm, visual data is acquired and features are extracted at the sensing nodes locations to be after transmitted to enable further analysis at some central location. In such scenario, one of the key challenges is to design suitable feature coding schemes that are able to exploit the correlation among the features corresponding to (partially) overlapped views of the same visual scene. To achieve efficient coding, it is proposed to employ the distributed source coding paradigm as it does not require any communication between the sensing nodes (rather expensive in VSN) and it is parsimonious in terms of computational resources. Experimental results show that significant accuracy and compression gains (up to 37.36%) can be achieved when codingfeatures extracted from multiple views.
Recently, the latest advances in compact feature representation and feature learning have provided an efficient framework for several visual analysis tasks, such as object recognition. However, when multiple cameras w...
详细信息
ISBN:
(纸本)9781467372589
Recently, the latest advances in compact feature representation and feature learning have provided an efficient framework for several visual analysis tasks, such as object recognition. However, when multiple cameras with overlapping fields-of-view are employed, other visual analysis tasks such as depth estimation can be supported and object recognition accuracy can be improved. In this paper the problem of distributed visual analysis from multiple views of a scene is addressed, considering that computational power and bandwidth, at each camera sensor, are rather limited. More specifically, an efficient coding technique for local binary features is proposed which exploits the correlation at the decoder side between each descriptor and its quantized representation. Moreover, considering that descriptors representing the same visual feature across different views are well correlated, a technique to avoid the transmission of redundant descriptors from multiple views is proposed. At the decoder, the joint statistics of all descriptors from all views is used to drive the selection of the best descriptors to be transmitted by each sensing node. The proposed multi-view feature coding and selection techniques allow obtaining bitrate reductions up to 80%, with respect to the uncompressed descriptor rate, for a certain task accuracy.
Learning efficient image representations is at the core of the scene classification task of remote sensing imagery. The existing methods for solving the scene classification task, based on either feature coding approa...
详细信息
Learning efficient image representations is at the core of the scene classification task of remote sensing imagery. The existing methods for solving the scene classification task, based on either feature coding approaches with low-level hand-engineered features or unsupervised feature learning, can only generate mid-level image features with limited representative ability, which essentially prevents them from achieving better performance. Recently, the deep convolutional neural networks (CNNs), which are hierarchical architectures trained on large-scale datasets, have shown astounding performance in object recognition and detection. However, it is still not clear how to use these deep convolutional neural networks for high-resolution remote sensing (HRRS) scene classification. In this paper, we investigate how to transfer features from these successfully pre-trained CNNs for HRRS scene classification. We propose two scenarios for generating image features via extracting CNN features from different layers. In the first scenario, the activation vectors extracted from fully-connected layers are regarded as the final image features;in the second scenario, we extract dense features from the last convolutional layer at multiple scales and then encode the dense features into global image features through commonly used feature coding approaches. Extensive experiments on two public scene classification datasets demonstrate that the image features obtained by the two proposed scenarios, even with a simple linear classifier, can result in remarkable performance and improve the state-of-the-art by a significant margin. The results reveal that the features from pre-trained CNNs generalize well to HRRS datasets and are more expressive than the low- and mid-level features. Moreover, we tentatively combine features extracted from different CNN models for better performance.
暂无评论