This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that ac...
详细信息
ISBN:
(纸本)9781665445092
This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the fewshot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel.
Understanding the nutritional content of food from visual data is a challenging computervision problem, with the potential to have a positive and widespread impact on public health. Studies in this area are limited t...
详细信息
ISBN:
(纸本)9781665445092
Understanding the nutritional content of food from visual data is a challenging computervision problem, with the potential to have a positive and widespread impact on public health. Studies in this area are limited to existing datasets in the field that lack sufficient diversity or labels required for training models with nutritional understanding capability. We introduce Nutrition5k, a novel dataset of 5k diverse, real world food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. We demonstrate the potential of this dataset by training a computervision algorithm capable of predicting the caloric and macronutrient values of a complex, real world dish at an accuracy that outperforms professional nutritionists. Further we present a baseline for incorporating depth sensor data to improve nutrition predictions. We release Nutrition5k in the hope that it will accelerate innovation in the space of nutritional understanding.
Brand logos are often rendered in a different style based on a context such as an event promotion. For example, Warner Bros. uses a different variety of their brand logo for different movies for promotion and aestheti...
详细信息
ISBN:
(纸本)9781665448994
Brand logos are often rendered in a different style based on a context such as an event promotion. For example, Warner Bros. uses a different variety of their brand logo for different movies for promotion and aesthetic appeal. In this paper, we propose an automated method to render brand logos in the coloring style of branding material such as movie posters. For this, we adopt a photo-realistic neural style transfer method using movie posters as the style source. We propose a color-based image segmentation and matching method to assign style segments to logo segments. Using these, we render the well-known Warner Bros. logo in the coloring style of 141 movie posters. We also present survey results where 287 participants rate the machine-stylized logos for their representativeness and visual appeal.
This paper presents an eye gaze tracking system aimed at facilitating computer access for physically challenged individuals, particularly those with amputations or quadriplegia. Acting as a mouse interface, the system...
详细信息
Foot bending is a basic component of walking. Although bending is important for foot-ground interaction and ensures the speed and balance of the whole walking cycle, there is still no practical method for recording an...
详细信息
In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discrimina...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discriminability among frame representations, high output temporal resolution to maintain prediction precision, and the necessity to capture information at different temporal scales to handle events with varying dynamics. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability. Leveraging these characteristics, T-DEED achieves SOTA performance on the FigureSkating and FineDiving datasets. Code is available at https://***/arturxe2/T-DEED.
Object tracking is a core task in the field of computervision, with wide-ranging applications in scenarios such as video surveillance and autonomous driving. This paper focuses on the application of the SeqTrack meth...
详细信息
Deep-learning based generative models are proven to be capable for achieving excellent results in numerous image processing tasks with a wide range of applications. One significant improvement of deep-learning approac...
详细信息
ISBN:
(纸本)9781665448994
Deep-learning based generative models are proven to be capable for achieving excellent results in numerous image processing tasks with a wide range of applications. One significant improvement of deep-learning approaches compared to traditional approaches is their ability to regenerate semantically coherent images by only relying on an input with limited information. This advantage becomes even more crucial when the input size is only a very minor proportion of the output size. Such image expansion tasks can be more challenging as the missing area may originally contain many semantic features that are critical in judging the quality of an image. In this paper we propose an edge-guided generative network model for producing semantically consistent output from a small image input. Our experiments show the proposed network is able to regenerate high quality images even when some structural features are missing in the input.
The effectiveness, speed, wealth of documentation, community support, integration possibilities, platform freedom, integrated face detection models, and ongoing development make OpenCV the tool of choice for face dete...
详细信息
Convolutional neural networks are able to learn realistic image priors from numerous training samples in low-level image generation and restoration [66]. We show that, for high-level image recognition tasks, we can fu...
详细信息
ISBN:
(纸本)9781665448994
Convolutional neural networks are able to learn realistic image priors from numerous training samples in low-level image generation and restoration [66]. We show that, for high-level image recognition tasks, we can further reconstruct "realistic" images of each category by leveraging intrinsic Batch Normalization (BN) statistics without any training data. Inspired by the popular VAE/GAN methods, we regard the zero-shot optimization process of synthetic images as generative modeling to match the distribution of BN statistics. The generated images serve as a calibration set for the following zero-shot network quantizations. Our method meets the needs for quantizing models based on sensitive information, e.g., due to privacy concerns, no data is available. Extensive experiments on benchmark datasets show that, with the help of generated data, our approach consistently outperforms existing data-free quantization methods.
暂无评论