In this paper, we introduce a novel framework for video-based action recognition, which incorporates the sequential information with the spatiotemporal features. Specifically, the spatiotemporal features are extracted...
详细信息
ISBN:
(纸本)9781509014378
In this paper, we introduce a novel framework for video-based action recognition, which incorporates the sequential information with the spatiotemporal features. Specifically, the spatiotemporal features are extracted from the sliced clips of videos, and then a recurrent neural network is applied to embed the sequential information into the final feature representation of the video. In contrast to most current deep learning methods for the video-based tasks, our framework incorporates both long-term dependencies and spatiotemporal information of the clips in the video. To extract the spatiotemporal features from the clips, both dense trajectories (DT) and a newly proposed 3D neural network, C3D, are applied in our experiments. Our proposed framework is evaluated on the benchmark datasets of UCF101 and HMDB51, and achieves comparable performance compared with the state-of-the-art results.
Scene classification is a major open challenge in machine vision. Most solutions proposed so far such as those based on color histograms and local texture statistics cannot capture a scene's global configuration, ...
详细信息
ISBN:
(纸本)0780342364
Scene classification is a major open challenge in machine vision. Most solutions proposed so far such as those based on color histograms and local texture statistics cannot capture a scene's global configuration, which is critical in perceptual judgments of scene similarity. We present a novel approach, ''configural recognition'', for encoding scene class structure. The approach's main feature is its use of qualitative spatial and photometric relationships within and across regions in low resolution images. The emphasis on qualitative measures leads to enhanced generalization abilities and the use of low-resolution images renders the scheme computationally efficient. We present results on a large database of natural scenes. We also describe how qualitative scene concepts may be learned from examples.
CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, wh...
详细信息
ISBN:
(纸本)9781665445092
CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations features. CoMoGAN can be used with any GAN backbone and allows new types of image translation, such as cyclic image translation like timelapse generation, or detached linear translation. On all datasets, it outperforms the literature.
People enjoy food photography because they appreciate food. Behind each meal there is a story described in a complex recipe and, unfortunately, by simply looking at a food image we do not have access to its preparatio...
详细信息
ISBN:
(纸本)9781728132938
People enjoy food photography because they appreciate food. Behind each meal there is a story described in a complex recipe and, unfortunately, by simply looking at a food image we do not have access to its preparation process. Therefore, in this paper we introduce an inverse cooking system that recreates cooking recipes given food images. Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously. We extensively evaluate the whole system on the large-scale Recipe1M dataset and show that (1) we improve performance w.r.t. previous baselines for ingredient prediction;(2) we are able to obtain high quality recipes by leveraging both image and ingredients;(3) our system is able to produce more compelling recipes than retrieval-based approaches according to human judgment. We make code and models publicly available(1).
Online dictionary learning is particularly useful for processing large-scale and dynamic data in computervision. It, however faces the major difficulty to incorporate robust functions, rather than the square data fit...
详细信息
ISBN:
(纸本)9780769549897
Online dictionary learning is particularly useful for processing large-scale and dynamic data in computervision. It, however faces the major difficulty to incorporate robust functions, rather than the square data fitting term, to handle outliers in training data. In this paper we propose a new online framework enabling the use of l(1) sparse data fitting term in robust dictionary learning, notably enhancing the usability and practicality of this important technique. Extensive experiments have been carried out to validate our new framework.
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art [16]. Like Pseudo Labe...
详细信息
ISBN:
(纸本)9781665445092
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art [16]. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student's performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student.(1)
This work presents an occlusion aware hand tracker to reliably track both hands of a person using a monocular RGB camera. To demonstrate its robustness, we evaluate the tracker on a challenging, occlusion-ridden natur...
详细信息
ISBN:
(纸本)9781509014378
This work presents an occlusion aware hand tracker to reliably track both hands of a person using a monocular RGB camera. To demonstrate its robustness, we evaluate the tracker on a challenging, occlusion-ridden naturalistic driving dataset, where hand motions of a driver are to be captured reliably. The proposed framework additionally encodes and learns tracklets corresponding to complex (yet frequently occurring) hand interactions offline, and makes an informed choice during data association. This provides positional information of the left and right hands with no intrusion (through complete or partial occlusions) over long, unconstrained video sequences in an online manner. The tracks thus obtained may find use in domains such as human activity analysis, gesture recognition, and higher-level semantic categorization.
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important pr...
详细信息
ISBN:
(纸本)9781538604571
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline obtains state-of-the-art results.
This paper introduces a new method for object recognition which is based on a recurrent neural network trained in a supervised mode. The RNN inputs 3-dimensional laser scanner data sequentially, in a natural temporal ...
详细信息
ISBN:
(纸本)9781424439942
This paper introduces a new method for object recognition which is based on a recurrent neural network trained in a supervised mode. The RNN inputs 3-dimensional laser scanner data sequentially, in a natural temporal order in which the laser returns arrive to the scanner The method is illustrated on a two-class problem with real data.
In this paper, we tackle the problem of performing inference in graphical models whose energy is a polynomial function of continuous variables. Our energy minimization method follows a dual decomposition approach, whe...
详细信息
ISBN:
(纸本)9780769549897
In this paper, we tackle the problem of performing inference in graphical models whose energy is a polynomial function of continuous variables. Our energy minimization method follows a dual decomposition approach, where the global problem is split into subproblems defined over the graph cliques. The optimal solution to these subproblems is obtained by making use of a polynomial system solver. Our algorithm inherits the convergence guarantees of dual decomposition. To speed up optimization, we also introduce a variant of this algorithm based on the augmented Lagrangian method. Our experiments illustrate the diversity of computervision problems that can be expressed with polynomial energies, and demonstrate the benefits of our approach over existing continuous inference methods.
暂无评论