Let's imagine you could choose the position of the camera for a particular face analysis task - where would you put it? In this work, we provide a first analysis based on synthetic training data to provide evidenc...
Let's imagine you could choose the position of the camera for a particular face analysis task - where would you put it? In this work, we provide a first analysis based on synthetic training data to provide evidence that this choice is not trivial, not only dependent on the training data and different based on the task. We provide results for two major face analysis tasks, face recognition and landmark detection. For our experiments, we use a 3D Morphable Model as it provides us full control over pose, illumination, and identity to generate idealized training data. Whilst rendered images are not photorealistic we avoid any confounding factors and biases from other sources (e.g. pose bias in training data).Our results show that the optimal camera poses are near frontal but not exactly frontal and dependent on the task. By comparing the results obtained by pose-specific training set to a uniform training distribution without pose bias we show that the accuracy for both tasks not only depends on the bias in the training data but is actually dominated by the difficulty of the task depending on the particular pose.
Many works have tried to solve the catastrophic forgetting (CF) problem in continual learning (lifelong learning). However, pursuing non-forgetting on old tasks may damage the model's plasticity for new tasks. Alt...
Many works have tried to solve the catastrophic forgetting (CF) problem in continual learning (lifelong learning). However, pursuing non-forgetting on old tasks may damage the model's plasticity for new tasks. Although some methods have been proposed to achieve stability-plasticity trade-off, no methods have considered evaluating a model's plasticity and improving plasticity adaptively for a new task. In this work, we propose a new method, called adaptive plasticity improvement (API), for continual learning. Besides the ability to overcome CF on old tasks, API also tries to evaluate the model's plasticity and then adaptively improve the model's plasticity for learning a new task if necessary. Experiments on several real datasets show that API can outperform other state-of-the-art baselines in terms of both accuracy and memory usage.
Most neural networks for computervision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB netw...
Most neural networks for computervision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements - ViT and data augmentation - we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart. 1 1 Code available at https://***/JeongsooP/RGB-no-more
Automatic video production of sports aims at producing an aesthetic broadcast of sporting events. We present a new video system able to automatically produce a smooth and pleasant broadcast of Basketball games using a...
详细信息
ISBN:
(数字)9781728193601
ISBN:
(纸本)9781728193601
Automatic video production of sports aims at producing an aesthetic broadcast of sporting events. We present a new video system able to automatically produce a smooth and pleasant broadcast of Basketball games using a single fixed 4K camera. The system automatically detects and localizes players, ball and referees, to recognize main action coordinates and game states yielding to a professional cameraman-like production of the basketball event. We also release a fully annotated dataset consisting of single 4K camera and twelve-camera videos of basketball games.
Stochastic filtering is widely used to deal with nonlinear optimization problems such as 3-D and visual tracking in various computervision and augmented reality applications. Many current methods suffer from an imbal...
Stochastic filtering is widely used to deal with nonlinear optimization problems such as 3-D and visual tracking in various computervision and augmented reality applications. Many current methods suffer from an imbalance between exploration and exploitation due to their particle degeneracy and impoverishment, resulting in local optimums. To address this imbalance, this work proposes a new constrained evolutionary diffusion filter for nonlinear optimization. Specifically, this filter develops spatial state constraints and adaptive history-recall differential evolution embedded evolutionary stochastic diffusion instead of sequential resampling to resolve the degeneracy and impoverishment problem. With application to monocular endoscope 3-D tracking, the experimental results show that the proposed filtering significantly improves the balance between exploration and exploitation and certainly works better than recent 3-D tracking methods. Particularly, the surgical tracking error was reduced from 4.03 mm to 2.59 mm.
The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two importan...
The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention and the modelling capacity beyond the regional range. Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range via anchored stripe self-attention, window self-attention, and channel attention enhanced convolution. Finally, the proposed network is applied to 7 image restoration types, covering both real and synthetic settings. The proposed method sets the new state-of-the-art for several of those. Code will be available at https://***/ofsoundof/***.
Human-object interaction (HOI) detection is a core task in computervision. The goal is to localize all human-object pairs and recognize their interactions. An interaction defined by a tuple leads to a long-tailed vi...
详细信息
ISBN:
(纸本)9781728193601
Human-object interaction (HOI) detection is a core task in computervision. The goal is to localize all human-object pairs and recognize their interactions. An interaction defined by a tuple leads to a long-tailed visual recognition challenge since many combinations are rarely represented. The performance of the proposed models is limited especially for the tail categories, but little has been done to understand the reason. To that end, in this paper, we propose to diagnose rarity in HOI detection. We propose a three-step strategy, namely Detection, Identification and recognition where we carefully analyse the limiting factors by studying state-of-the-art models. Our findings indicate that detection and identification steps are altered by the interaction signals like occlusion and relative location, as a result limiting the recognition accuracy.
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising more than 11M images, it mostly consists of natural photographic (visible band) images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities beyond the visible spectrum. For this reason, this work comprehensively assesses SAM capabilities in segmenting objects of interest in the X-ray and infrared imaging modalities. Our approach reuses and preserves the pre-trained SAM with three different prompts, namely bounding box, centroid and random points. We present several quantitative and qualitative results to showcase the performance of SAM on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. Additionally, we find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. Overall, this study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal gener
This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use vision Transformers (ViTs). Existing works have...
This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generate...
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels. We show that vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.
暂无评论