Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a nov...
Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a novel 3D pre-training vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings produced by CLIP. To assess our model’s 3D world reasoning capability, we evaluate it on the downstream task of 3D Visual Question Answering. Experimental quantitative and qualitative results show that our pre-training method outperforms state-of-the-art works in this task and leads to an interpretable representation of 3D scene features.
The common practice in sign language recognition is to ?rst construct individual sign models, in terms of discrete state transitions, mostly represented using Hidden Markov Models, from manually isolated sign samples ...
详细信息
The common practice in sign language recognition is to ?rst construct individual sign models, in terms of discrete state transitions, mostly represented using Hidden Markov Models, from manually isolated sign samples and then to use it to recognize signs in continuous sentences. In this paper we (i) propose a continuous state space model, where the states are based on purely image-based features, without the use of special gloves, and (ii) present an unsupervised approach to both extract and learn models for continuous basic units of signs, which we term as signemes, from continuous sentences. Given a set of sentences with a common sign, we can automatically learn the model for part of the sign, or signeme, that is least affected by coarticulation effects. While there are coarticulation effects in speech recognition, these effects are even stronger in sign language. The model itself is in term of traces in a space of Relational Distributions. Each point in this space represents a Relational Distribution, capturing the spatial relationships between low-level features, such as edge points. We perform speed normalization and then incrementally extract the common sign between sentences, or signemes, with a dynamic programming framework at the core to compute warped distance between two subsentences. We test our idea using the publicly available Boston SignStream Dataset by building signeme models of 18 signs. We test the quality of the models by considering how well we can localize the sign in a new sentence. We also present preliminary results for the ability to generalize across signers.
The simultaneous use of images from different spectra can be helpful to improve the performance of many computervision tasks. The core idea behind the usage of cross-spectral approaches is to take advantage of the st...
详细信息
The simultaneous use of images from different spectra can be helpful to improve the performance of many computervision tasks. The core idea behind the usage of cross-spectral approaches is to take advantage of the strengths of each spectral band providing a richer representation of a scene, which cannot be obtained with just images from one spectral band. In this work we tackle the cross-spectral image similarity problem by using Convolutional Neural Networks (CNNs). We explore three different CNN architectures to compare the similarity of cross-spectral image patches. Specifically, we train each network with images from the visible and the near-infrared spectrum, and then test the result with two public cross-spectral datasets. Experimental results show that CNN approaches outperform the current state-of-art on both cross-spectral datasets. Additionally, our experiments show that some CNN architectures are capable of generalizing between different cross-spectral domains.
During the past few years we have witnessed the development of many methodologies for building and fitting Statistical Deformable Models (SDMs). The construction of accurate SDMs requires careful annotation of images ...
详细信息
ISBN:
(纸本)9781467388511
During the past few years we have witnessed the development of many methodologies for building and fitting Statistical Deformable Models (SDMs). The construction of accurate SDMs requires careful annotation of images with regards to a consistent set of landmarks. However, the manual annotation of a large amount of images is a tedious, laborious and expensive procedure. Furthermore, for several deformable objects, e.g. human body, it is difficult to define a consistent set of landmarks, and, thus, it becomes impossible to train humans in order to accurately annotate a collection of images. Nevertheless, for the majority of objects, it is possible to extract the shape by object segmentation or even by shape drawing. In this paper, we show for the first time, to the best of our knowledge, that it is possible to construct SDMs by putting object shapes in dense correspondence. Such SDMs can be built with much less effort for a large battery of objects. Additionally, we show that, by sampling the dense model, a part-based SDM can be learned with its parts being in correspondence. We employ our framework to develop SDMs of human arms and legs, which can be used for the segmentation of the outline of the human body, as well as to provide better and more consistent annotations for body joints.
In order to make the Web information system development more efficient, improving the reusability, stability and scalability of system, features of a Web information system development are studied. A Web information s...
详细信息
One of the more important limitations of actual tools for performing arts production and design is that collaboration between designers is hard to achieve. In fact, designers must actually be co-located to collaborate...
详细信息
Information obtained from calibrated cameras by means of computervision is integrated with location events from an ultrasonic tracking system deployed in an indoor office. This results in improved estimates of state ...
详细信息
Information obtained from calibrated cameras by means of computervision is integrated with location events from an ultrasonic tracking system deployed in an indoor office. This results in improved estimates of state and location which are used to augment the environmental model maintained by a sentient computing system. Fusion of the different sources of information takes place at a high level using Bayesian networks to model dependencies and reliabilities of the multi-modal variables. Context is represented using a world model of both the static and dynamic environment. The world model serves both as an ontology of prior information for multi-modal integration and as a source of context for applications.
Robust and reliable detection of overtaking vehicles is an important component of any on-board driver assistance system. Optical flow, with the abundant motion information present in image sequences, has been studied ...
详细信息
Robust and reliable detection of overtaking vehicles is an important component of any on-board driver assistance system. Optical flow, with the abundant motion information present in image sequences, has been studied extensively for vehicle detection. However, using dense optical ?ow for vehicle detection is sensitive to shocks and vibrations of the mobile camera; image outliers caused by illumination changes; and high computational complexity. To improve vehicle detection performance and reduce computational complexity, we propose an efficient and robust methodology for overtaking vehicle detection based on homogeneous sparse optical flow and eigenspace modeling. Specifically, our method models the background into dynamic and quasi-static regions. Instead of using dense optical flow to model the dynamic parts of the background, we employ homogeneous sparse optical flow, which makes detection more robust to camera shocks and vibrations. Moreover, to make detection robust to illumination changes, we employ a blockbased eigenspace approach to represent quasi-static regions in the background. A region-based hysteresis-thresholding approach, augmented by a localized spatial segmentation procedure, attains a good tradeoff between true detections and false positives. The proposed methodology has been evaluated using challenging traf?c scenes illustrating good performance.
Egocentric vision provides a unique perspective of the visual world that is inherently human-centric. Since egocentric cameras are mounted on the user (typically on the user's head), they are naturally primed to g...
详细信息
Egocentric vision provides a unique perspective of the visual world that is inherently human-centric. Since egocentric cameras are mounted on the user (typically on the user's head), they are naturally primed to gather visual information from our everyday interactions, and can even act on that information in real-time (e.g. for a vision aid). We believe that this human-centric characteristic of egocentric vision can have a large impact on the way we approach central computervision tasks such as visual detection, recognition, prediction, and socio-behavioral analysis. By taking advantage of the first-person point-of-view paradigm, there have been recent advances in areas such as personalized video summarization, understanding concepts of social saliency, activity analysis with inside-out cameras (a camera to capture eye gaze and an outward-looking camera), recognizing human interactions and modeling focus of attention. However, in many ways people are only beginning to understand the full potential (and limitations) of the first-person paradigm. In the 3rd workshop on Egocentric (First-Person) vision, we bring together researchers to discuss emerging topics such as: Personalization of visual analysis; Socio-behavioral modeling; Understanding group dynamics and interactions; Egocentric video as big data; First-person vision for robotics; and Egographical User Interfaces (EUIs).
暂无评论