Object detection has reached strong performance in the last decade, having seen its usage spreading to various application areas, such as medicine, transportation, sports, and others. However, one of the more underuti...
详细信息
Existing computervision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive...
详细信息
ISBN:
(纸本)9781665448994
Existing computervision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive Language-Image Pre-Training) [12] for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Model's zero-shot capability allows predicting the most relevant natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset [20], which we consider the largest annotated artwork dataset. Our code and models will be available at https://***/KeremTurgutlu/clip_art
The new era of technology is being greatly influenced by the field of artificial intelligence. computervision and deep learning have become increasingly important due to their ability to process vast amounts of data ...
详细信息
Hand gestures serve as a fundamental mode of non- verbal communication, intricately conveying messages through the positioning of palms, arrangement of fingers, and the form of the hand itself. Their significance lies...
详细信息
The development of underwater robotic systems with autonomous grasping capabilities is challenging due to the complexity of the operation environment, limited sensing performance, and computation load. This paper prop...
详细信息
With the rapid development of industrial automation, automated sorting of precision industrial components has become the key to improve production efficiency and reduce costs. In this paper, a machine vision-based sor...
详细信息
Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of vision-Language Model (VLM). However, these m...
详细信息
We address the problem of computing a textural loss based on the statistics extracted from the feature activations of a convolutional neural network optimized for object recognition (e.g. VGG-19). The underlying mathe...
详细信息
ISBN:
(纸本)9781665445092
We address the problem of computing a textural loss based on the statistics extracted from the feature activations of a convolutional neural network optimized for object recognition (e.g. VGG-19). The underlying mathematical problem is the measure of the distance between two distributions in feature space. The Gram-matrix loss is the ubiquitous approximation for this problem but it is subject to several shortcomings. Our goal is to promote the Sliced Wasserstein Distance as a replacement for it. It is theoretically proven, practical, simple to implement, and achieves results that are visually superior for texture synthesis by optimization or training generative neural networks.
Deep learning has brought tremendous progress in computervision and natural language processing, and is used in multiple non-critical applications. A major bottleneck for its use in many other areas is the black box ...
详细信息
ISBN:
(纸本)9781665458245
Deep learning has brought tremendous progress in computervision and natural language processing, and is used in multiple non-critical applications. A major bottleneck for its use in many other areas is the black box nature of these algorithms, resulting in a lack of explainability in their decisions. One of the key problems identified is the confounding effect, which causes confusion between the desired causes and other irrelevant factors affecting an outcome. This is more pronounced in the spatio-temporal case, such as the bias on the static background in the classification of a video. A way to handle this is by making use of sensors that capture additional scene properties, to mitigate spurious associations. In this work, we integrate the polarimetric videos with deep learning and evaluate it on the popular action recognition problem. We construct a dataset of polarimetric videos for fine-grained actions and study the effect of various parameters, extracted from the polarimetric video frames, as inputs to a deep network. Using these observations, we design a spatio-temporal polarization network (STP-Net) to effectively extract polarimetric features. This is evaluated on the recent HumanAct12 dataset for human activity recognition. Extensive evaluation clearly shows that the polarimetric modality is able to localize the correct action regions, leading to better generalizability.
In computervision, datasets and benchmarks are widely used to compare algorithms and boost scientific progress. Especially in the human action recognition research field, extracting dance poses from video sequences f...
详细信息
暂无评论