We have witnessed rapid progress on 3D-aware image synthesis, leveraging recent advances in generative visual models and neural rendering. Existing approaches however fall short in two ways: first, they may lack an un...
详细信息
ISBN:
(纸本)9781665445092
We have witnessed rapid progress on 3D-aware image synthesis, leveraging recent advances in generative visual models and neural rendering. Existing approaches however fall short in two ways: first, they may lack an underlying 3D representation or rely on view-inconsistent rendering, hence synthesizing images that are not multi-view consistent;second, they often depend upon representation network architectures that are not expressive enough, and their results thus lack in image quality. We propose a novel generative model, named Periodic Implicit Generative Adversarial Networks (pi-GAN or pi-GAN), for high-quality 3D-aware image synthesis. pi-GAN leverages neural representations with periodic activation functions and volumetric rendering to represent scenes as view-consistent radiance fields. The proposed approach obtains state-of-the-art results for 3D-aware image synthesis with multiple real and synthetic datasets.
Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experi-ence difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsuperv...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experi-ence difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo masks and a simple video synthesis method for model training is surprisingly sufficient to enable the resulting video model to effectively segment and track multiple instances across video frames. We show the first competitive unsupervised learning results on the challenging YouTube Vis-2019 benchmark, achieving 50.7%
$AP_{50}^{video}$
, surpassing the previous state-of-the-art by a large margin. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS-2019 in terms of
$AP^{video}$
.
The main function of depth completion is to compensate for an insufficient and unpredictable number of sparse depth measurements of hardware sensors. However, existing research on depth completion assumes that the spa...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
The main function of depth completion is to compensate for an insufficient and unpredictable number of sparse depth measurements of hardware sensors. However, existing research on depth completion assumes that the sparsity - the number of points or LiDAR lines - is fixed for training and testing. Hence, the completion performance drops severely when the number of sparse depths changes significantly. To address this issue, we propose the sparsity-adaptive depth refinement (SDR) framework, which refines monocular depth estimates using sparse depth points. For SDR, we propose the masked spatial propagation network (MSPN) to perform SDR with a varying number of sparse depths effectively by gradually propagating sparse depth information throughout the entire depth map. Experimental results demonstrate that MPSN achieves state-of-the-art performance on both SDR and conventional depth completion scenarios. Codes are available at htt ps: / / ***/jyjunmcl/MSPN_SDR
As wearable cameras become more popular, an important question emerges: how to identify camera wearers within the perspective of conventional static cameras. The drastic difference between first-person (egocentric) an...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
As wearable cameras become more popular, an important question emerges: how to identify camera wearers within the perspective of conventional static cameras. The drastic difference between first-person (egocentric) and third-person (exocentric) camera views makes this a challenging task. We present PersonEnvironmentNet (PEN), a framework designed to integrate information from both the individuals in the two views and geometric cues inferred from the background environment. To facilitate research in this direction, we also present TF2023, a novel dataset comprising synchronized first-person and third-person views, along with masks of camera wearers and labels associating these masks with the respective first-person views. In addition, we propose a novel quantitative metric designed to measure a model's ability to comprehend the relationship between the two views. Our experiments re- veal that PEN outperforms existing methods. The code and dataset are available at https://***/ziweizhao1993/PEN.
We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to im...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to improve detection accuracy. By training on pseudo-anomalies, our approach adapts to the variability of normal and anomalous behaviors without fixed anomaly thresholds. Our model showcases superior performance on the Ped2, Avenue and ShanghaiTech datasets, where individual models are tailored for each scene. These achievements highlight DDL’s effectiveness in advancing anomaly detection, offering a scalable and adaptable solution for video surveillance challenges. Our work can be found on: https://***/demetrislappas/***
Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross-modal domain. This paper introduces PopDanceSet, the first dataset tailored to the preferences of yo...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross-modal domain. This paper introduces PopDanceSet, the first dataset tailored to the preferences of young audiences, enabling the generation of aesthetically oriented dances. And it surpasses the
${\it AIST}++{\it dataset}$
in music genre di-versity and the intricacy and depth of dance movements. Moreover, the proposed POPDG model within the iD-DPMframework enhances dance diversity and, through the Space Augmentation Algorithm, strengthens spatial physi-cal connections between human body joints, ensuring that increased diversity does not compromise generation qual-ity. A streamlined Alignment Module is also designed to improve the temporal alignment between dance and mu-sic. Extensive experiments show that POPDG achieves SOTA results on two datasets. Furthermore, the paper also expands on current evaluation metrics. The dataset and code are available at https://***/Luke-Luol/POPDG.
Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However, this approach faces challenges in modeling scene dynamics in urban envir...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However, this approach faces challenges in modeling scene dynamics in urban environments, where moving objects of various categories and scales are present. In such settings, it becomes crucial to effectively eliminate moving objects to accurately reconstruct static backgrounds. Our research introduces an innovative method, termed here as Entity-NeRF, which combines the strengths of knowledge-based and statistical strategies. This approach utilizes entity-wise statistics, leveraging entity segmentation and stationary entity classification through thing/stuff segmentation. To assess our methodology, we created an urban scene dataset masked with moving objects. Our comprehensive experiments demonstrate that Entity-NeRF notably outperforms existing techniques in removing moving objects and reconstructing static urban backgrounds, both quantitatively and qualitatively.
1 1
Our project page is available at https://***/entitynerf/
vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However, we observe a notable misalignment phenom...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However, we observe a notable misalignment phenomenon, that is, the affinity between samples has an obvious disparity across different modalities, namely “Affinity Inconsistency Problem”. Our intuition is that, for a well-aligned model, two images that look similar to each other should have the same level of similarity as their corresponding texts that describe them. In this paper, we first investigate the reason of this inconsistency problem. We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem, we propose a novel loss function, named Sample-wise affinity Consistency (SaCo) loss, which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. Our SaCo loss can be easily incorporated into existing vision-language models as an additional loss due to its complementarity for most training objectives. In addition, considering that pre-training from scratch is computationally expensive, we also provide a more efficient way to continuously pre-train on a converged model by integrating our loss. Experimentally, the model trained with our SaCo loss significantly outperforms the baseline on a variety of vision and language tasks.
Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mec...
详细信息
ISBN:
(纸本)9781665445092
Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set. We make the following four contributions: (i) We propose a new model-a Temporal Query Network-which enables the query-response functionality, and a structural understanding of fine-grained actions. It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query. (ii) We propose a new way-stochastic feature bank update-to train a network on videos of various lengths with the dense sampling required to respond to fine-grained queries. (iii) we compare the TQN to other architectures and text supervision methods, and analyze their pros and cons. Finally, (iv) we evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features. Project page: https://***/-vgg/research/tqn/.
The boundless possibility of neural networks which can be used to solve a problem - each with different performance - leads to a situation where a Deep Learning expert is required to identify the best neural network. ...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
The boundless possibility of neural networks which can be used to solve a problem - each with different performance - leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However, to date, NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST, Language, MultNIST, CIFAR-Tile, Gutenberg, Isabella, GeoClassing, and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants.
暂无评论