The past few years have seen improvements in the frame rate, accuracy, cost, and size of 3D sensors, including systems based on stereo vision, time-of-flight, structured light, and depth-from-X methods. These exciting...
The past few years have seen improvements in the frame rate, accuracy, cost, and size of 3D sensors, including systems based on stereo vision, time-of-flight, structured light, and depth-from-X methods. These exciting developments have sparked increasing interest in industry and academia in the challenges and opportunities afforded by real-time 3D sensing. Applications such as tracking, recognition, and scene understanding may become more robust and more feasible thanks to these sensors.
We propose to formulate multi-target tracking as minimization of a continuous energy function. Other than a number of recent approaches we focus on designing an energy function that represents the problem as faithfull...
详细信息
We address the problem of labeling individual datapoints given some knowledge about (small) subsets or groups of them. The knowledge we have for a group is the likelihood value for each group member to satisfy a certa...
详细信息
In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in ...
In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comics panels. This is a significantly challenging problem because comics comprise disparate artistic styles, illustrations, layouts, and object scales that depend on the author's creative process. Typically, dense image-based prediction techniques require a large corpus of data. Finding an automated solution for dense prediction in the comics domain, therefore, becomes more difficult with the lack of ground-truth dense annotations for the comics images. To address these challenges, we develop the following solutions- we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; - we utilize the results of the translations to develop our multitasking approach that is based on a vision transformer backbone and a domain transferable attention module; -we study the feasibility of integrating our MTL dense-prediction method with an existing retargeting method, thereby reconfiguring comics.
In this paper we describe a new dataset, under construction, acquired inside the National Museum of Bargello in Florence. It was recorded with three IP cameras at a resolution of 1280 × 800 pixels and an average ...
详细信息
In this paper we describe a new dataset, under construction, acquired inside the National Museum of Bargello in Florence. It was recorded with three IP cameras at a resolution of 1280 × 800 pixels and an average framerate of five frames per second. Sequences were recorded following two scenarios. The first scenario consists of visitors watching different artworks (individuals), while the second one consists of groups of visitors watching the same artworks (groups). This dataset is specifically designed to support research on group detection, occlusion handling, tracking, re-identification and behavior analysis. In order to ease the annotation process we designed a user friendly web interface that allows to annotate: bounding boxes, occlusion area, body orientation and head gaze, group belonging, and artwork under observation. We provide a comparison with other existing datasets that have group and occlusion annotations. In order to assess the difficulties of this dataset we have also performed some tests exploiting seven representative state-of-the-art pedestrian detectors.
This paper presents a comparative evaluation of feature embeddings for classification and ranking in large-scale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data i...
详细信息
MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods on image classific...
MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods on image classification. However, most methods adopt spatial MLPs which take fixed-dimension inputs, therefore making it difficult to apply them as backbones to downstream tasks such as object detection and semantic segmentation, which require inputs with arbitrary dimension. Moreover, single-stage designs further limit the performance in other computervision tasks and fully-connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a Hierarchical Convolutional MLP for visual recognition, which is a lightweight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4 GMACs (15% and 19% of MLP-Mixer-B/16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred to downstream tasks and achieve competitive results with fewer parameters. Our code and pre-trained models are open-sourced at https://***/SHI-Labs/Convolutional-MLPs.
In this paper, we propose an ordinal hyperplane ranking algorithm called OHRank, which estimates human ages via facial images. The design of the algorithm is based on the relative order information among the age label...
详细信息
Conventional non-blind image deblurring algorithms involve natural image priors and maximum a-posteriori (MAP) estimation. As a consequence of MAP estimation, separate pre-processing steps such as noise estimation and...
详细信息
In this paper, we introduce the Equipment Nameplate Dataset, a large dataset for scene text detection and recognition. Natural images in this dataset are taken in the wild and thus this dataset includes various intra-...
详细信息
暂无评论