We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa. We sho...
详细信息
ISBN:
(纸本)9781665445092
We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa. We show that optimizing for cross-modal discrimination, rather than withinmodal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves highly competitive performance when finetuned on action recognition tasks. Furthermore, while recent work in contrastive learning defines positive and negative samples as individual instances, we generalize this definition by exploring cross-modal agreement. We group together multiple instances as positives by measuring their similarity in both the video and audio feature spaces. Cross-modal agreement creates better positive and negative sets, which allows us to calibrate visual similarities by seeking within-modal discrimination of positive instances, and achieve significant gains on downstream tasks.
This paper proposes a handcrafted feature-based descriptor namely Local neighborhood average pattern (LNAP) for static hand gesture recognition. The fact, that the local descriptors are important in numerous computer ...
详细信息
We introduce a novel method for collecting table tennis video data and perform stroke detection and classification. A diverse dataset containing video data of 11 basic strokes obtained from 14 professional table tenni...
详细信息
ISBN:
(纸本)9781665448994
We introduce a novel method for collecting table tennis video data and perform stroke detection and classification. A diverse dataset containing video data of 11 basic strokes obtained from 14 professional table tennis players, summing up to a total of 22111 videos has been collected using the proposed setup. The temporal convolutional neural network model developed using 2D pose estimation performs multiclass classification of these 11 table tennis strokes with a validation accuracy of 99.37%. Moreover, the neural network generalizes well over the data of a player excluded from the training and validation dataset, classifying the fresh strokes with an overall best accuracy of 98.72%. Various model architectures using machine learning and deep learning based approaches have been trained for stroke recognition and their performances have been compared and benchmarked. Inferences such as performance monitoring and stroke comparison of the players using the model have been discussed. Therefore, we are contributing to the development of a computervision based sports analytics system for the sport of table tennis that focuses on the previously unexploited aspect of the sport i.e., a player's strokes, which is extremely insightful for performance improvement.
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representatio...
详细信息
ISBN:
(纸本)9781665445092
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for varieties of in-the-wild scenes, including thin structures, view-dependent effects, and complex degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.
Human trajectory forecasting in crowds, at its core, is a sequence prediction problem with specific challenges of capturing inter-sequence dependencies (social interactions) and consequently predicting socially-compli...
详细信息
ISBN:
(纸本)9781665445092
Human trajectory forecasting in crowds, at its core, is a sequence prediction problem with specific challenges of capturing inter-sequence dependencies (social interactions) and consequently predicting socially-compliant multi-modal distributions. In recent years, neural network-based methods have been shown to outperform hand-crafted methods on distance-based metrics. However, these data-driven methods still suffer from one crucial limitation: lack of interpretability. To overcome this limitation, we leverage the power of discrete choice models to learn interpretable rule-based intents, and subsequently utilise the expressibility of neural networks to model scene-specific residual. Extensive experimentation on the interaction-centric benchmark TrajNet++ demonstrates the effectiveness of our proposed architecture to explain its predictions without compromising the accuracy.
Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution. A particularly simple but ef...
详细信息
ISBN:
(纸本)9781665445092
Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution. A particularly simple but effective approach to feature alignment involves aligning the batch normalization statistics between the two distributions in a trained neural network. This technique has received renewed interest lately because of its impressive performance on robustness benchmarks. However, when and why this method works is not well understood. We investigate the approach in more detail and identify several limitations. We show that it only significantly helps with a narrow set of distribution shifts and we identify several settings in which it even degrades performance. We also explain why these limitations arise by pinpointing why this approach can be so effective in the first place. Our findings call into question the utility of this approach and Unsupervised Domain Adaptation more broadly for improving robustness in practice.
Traditional computervision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptio...
详细信息
ISBN:
(纸本)9781665448994
Traditional computervision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use a simple pretraining task of predicting the pairings between images and text captions. CLIP, however, is data hungry and requires more than 400M image text pairs for training. We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs. Our model transfers knowledge from pre-trained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP. Our method exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text encoder. We also beat CLIP by 10.5% relatively on zeroshot evaluation on Google Open Images (19,958 classes).
In the field of computervision, shadow detection has been a topic of considerable interest. Shadows encapsulate a wealth of information about the underlying light conditions and scene geometry, making them invaluable...
详细信息
ISBN:
(纸本)9798350385113;9798350385106
In the field of computervision, shadow detection has been a topic of considerable interest. Shadows encapsulate a wealth of information about the underlying light conditions and scene geometry, making them invaluable for a wide range of visual perception tasks, from understanding the physical properties of light to interpreting the structure and layout of the surrounding environment. Moreover, shadows affect image quality, thereby influencing the results of computervision algorithms such as object detection, recognition, and tracking. Therefore, identifying and eliminating shadows contributes to improving algorithm performance. However, shadow detection research faces the challenge of scarce high-quality annotated datasets. To address this problem, we introduce a novel shadow detection dataset called Shadow Detection Dataset (SDD), consisting of 2638 images covering various scenes including urban streets, natural landscapes, and indoor spaces. Additionally, we evaluate the performance of eight state-of-the-art object detection methods on SDD to compare and reveal their effectiveness in shadow detection tasks. Through comparative experimental results, we find significant differences in the performance of different methods across various scenes, underscoring the significance of the proposed dataset in assessing the performance of shadow detection techniques. In the future, we look forward to further expanding this dataset and exploring more effective shadow detection methods to meet the growing demands of applications and advance the field. The dataset is available to the public at https://***/hhaozhang/SDD.
Continual learning (CL) has become one of the most active research venues within the artificial intelligence community in recent years. Given the significant amount of attention paid to continual learning, the need fo...
详细信息
ISBN:
(纸本)9781665448994
Continual learning (CL) has become one of the most active research venues within the artificial intelligence community in recent years. Given the significant amount of attention paid to continual learning, the need for a library that facilitates both research and development in this field is more visible than ever. However, CL algorithms' codes are currently scattered over isolated repositories written with different frameworks, making it difficult for researchers and practitioners to work with various CL algorithms and benchmarks using the same interface. In this paper, we introduce CL-Gym, a full-featured continual learning library that overcomes this challenge and accelerates the research and development. In addition to the necessary infrastructure for running end-to-end continual learning experiments, CL-Gym includes benchmarks for various CL scenarios and several state-of-the-art CL algorithms. In this paper, we present the architecture, design philosophies, and technical details behind CL-Gym (1).
The Facial Action Coding System is a taxonomy for fine-grained facial expression analysis. This paper proposes a method for detecting Facial Action Units (FAU), which define particular face muscle activity, from an in...
详细信息
ISBN:
(纸本)9781665445092
The Facial Action Coding System is a taxonomy for fine-grained facial expression analysis. This paper proposes a method for detecting Facial Action Units (FAU), which define particular face muscle activity, from an input image. FAU detection is formulated as a multi-task learning problem, where image features and attention maps are input to a branch for each action unit to extract discriminative feature embeddings, using a new loss function, the center contrastive (CC) loss. We employ a new FAU correlation network, based on a transformer encoder architecture, to capture the relationships between different action units for the wide range of expressions in the training data. The resulting features are shown to yield high classification performance. We validate our design choices, including the use of CC-loss and Tversky loss functions, in ablative experiments. We show that the proposed method outperforms state-of-the-art techniques on two public datasets, BP4D and DISFA, with an absolute improvement of the F I-score of over 2% on each.
暂无评论