Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggest contrastive learning ...
详细信息
ISBN:
(纸本)9781665448994
Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggest contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on the downstream task of action recognition. Our method achieves competitive performance on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships. The code is available at http://***/nishantrai18/CoCon.
Foggy-scene semantic segmentation (FSSS) is highly challenging due to the diverse effects of fog on scene properties and the limited training data. Existing research has mainly focused on domain adaptation for FSSS, w...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Foggy-scene semantic segmentation (FSSS) is highly challenging due to the diverse effects of fog on scene properties and the limited training data. Existing research has mainly focused on domain adaptation for FSSS, which has practical limitations when dealing with new scenes. In our paper, we introduce domain-generalized FSSS, which can work effectively on unknown distributions without extensive training. To address domain gaps, we propose a frequency decoupling (FreD) approach that separates fog-related effects (amplitude) from scene semantics (phase) in feature representations. Our method is compatible with both CNN and vision Transformer backbones and outperforms existing approaches in various scenarios.
One of the most fundamental and information-laden actions humans do is to look at objects. However, a survey of current works reveals that existing gaze-related datasets annotate only the pixel being looked at, and no...
详细信息
ISBN:
(纸本)9781665448994
One of the most fundamental and information-laden actions humans do is to look at objects. However, a survey of current works reveals that existing gaze-related datasets annotate only the pixel being looked at, and not the boundaries of a specific object of interest. This lack of object annotation presents an opportunity for further advancing gaze estimation research. To this end, we present a challenging new task called gaze object prediction, where the goal is to predict a bounding box for a person's gazed-at object. To train and evaluate gaze networks on this task, we present the Gaze On Objects (GOO) dataset. GOO is composed of a large set of synthetic images (GOO-Synth) supplemented by a smaller subset of real images (GOO-Real) of people looking at objects in a retail environment. Our work establishes extensive baselines on GOO by re-implementing and evaluating selected state-of-the-art models on the task of gaze following and domain adaptation. Code is available(1) on github.
This paper proposes a novel model for predicting body mass index and various body part sizes using front, side, and back body images. The model is trained on a large dataset of labeled images. The results show that th...
详细信息
Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future prediction...
详细信息
ISBN:
(纸本)9781665448994
Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster.
In recent years, Multi-Camera Multiple Object Tracking (MCMT) has gained significant attention as a crucial computervision application. Research focuses on data association and track detection. However, accurately se...
详细信息
In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restorat...
详细信息
ISBN:
(纸本)9781665448994
In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restoration networks. Based on HIN Block, we design a simple and powerful multi-stage network named HINet, which consists of two subnetworks. With the help of HIN Block, HINet surpasses the state-of-the-art (SOTA) on various image restoration tasks. For image denoising, we exceed it 0.11dB and 0.28 dB in PSNR on SIDD dataset, with only 7.5% and 30% of its multiplier-accumulator operations (MACs), 6.8 x and 2.9x speedup respectively. For image deblurring, we get comparable performance with 22.5% of its MACs and 3.3 x speedup on REDS and GoPro datasets. For image deraining, we exceed it by 0.3 dB in PSNR on the average result of multiple datasets with 1.4x speedup. With HINet, we won the 1st place on the NTIRE 2021 Image Deblurring Challenge - Track2. JPEG Artifacts, with a PSNR of 29.70.
Understanding broadcast videos is a challenging task in computervision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a n...
详细信息
ISBN:
(纸本)9781665448994
Understanding broadcast videos is a challenging task in computervision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet [24] video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos. We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. For each task, we provide and discuss benchmark results, reproducible with our open-source adapted implementations of the most relevant works in the field. SoccerNet-v2 is presented to the broader research community to help push computervision closer to automatic solutions for more general video understanding and production purposes.
We present SrvfNet, a generative deep learning framework for the joint multiple alignment of large collections of functional data comprising square-root velocity functions (SRVF) to their templates. Our proposed frame...
详细信息
ISBN:
(纸本)9781665448994
We present SrvfNet, a generative deep learning framework for the joint multiple alignment of large collections of functional data comprising square-root velocity functions (SRVF) to their templates. Our proposed framework is fully unsupervised and is capable of aligning to a predefined template as well as jointly predicting an optimal template from data while simultaneously achieving alignment. Our network is constructed as a generative encoder-decoder architecture comprising fully-connected layers capable of producing a distribution space of the warping functions. We demonstrate the strength of our framework by validating it on synthetic data as well as diffusion profiles from magnetic resonance imaging (MRI) data.
To help meet the increasing need for dynamic vision sensor (DVS) event camera data, this paper proposes the v2e toolbox that generates realistic synthetic DVS events from intensity frames. It also clarifies incorrect ...
详细信息
ISBN:
(纸本)9781665448994
To help meet the increasing need for dynamic vision sensor (DVS) event camera data, this paper proposes the v2e toolbox that generates realistic synthetic DVS events from intensity frames. It also clarifies incorrect claims about DVS motion blur and latency characteristics in recent literature. Unlike other toolboxes, v2e includes pixel-level Gaussian event threshold mismatch, finite intensity-dependent bandwidth, and intensity-dependent noise. Realistic DVS events are useful in training networks for uncontrolled lighting conditions. The use of v2e synthetic events is demonstrated in two experiments. The first experiment is object recognition with N-Caltech 101 dataset. Results show that pretraining on various v2e lighting conditions improves generalization when transferred on real DVS data for a ResNet model. The second experiment shows that for night driving, a car detector trained with v2e events shows an average accuracy improvement of 40% compared to the YOLOv3 trained on intensity frames.
暂无评论