Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on larg...
详细信息
Recently, large-scale pre-trained vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categori...
Recently, large-scale pre-trained vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zero-shot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine-tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision.
The aorta is the largest vessel of the human body and its pathological degenerations, such as dissections and aneurysms, can be life threatening. An automatic and fast segmentation of the aorta can therefore be a help...
详细信息
Since its inception, the CUDA programming model has been continuously evolving. Because the CUDA toolkit aims to consistently expose cutting-edge capabilities for general-purpose compute jobs to its users, the added f...
详细信息
Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Match...
Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture-and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance.
Mobile edge computing (MEC) is a newly emerging concept that provides significant local computing power and reduces end-to-end latency. In MEC environments, caching frequently accessed services on edge servers effecti...
详细信息
Pedestrians and cyclists suffer the most serious injuries in traffic accidents. Existing Pedestrian Protection Systems and Road Safety Systems rely on an ideal model of pedestrian behavior and do not consider that peo...
详细信息
ISBN:
(数字)9781665463829
ISBN:
(纸本)9781665463836
Pedestrians and cyclists suffer the most serious injuries in traffic accidents. Existing Pedestrian Protection Systems and Road Safety Systems rely on an ideal model of pedestrian behavior and do not consider that people tend to take shortcuts, appear at unexpected places or can be distracted on the road, for example, by using a smartphone or wearing headphones. Collecting and analyzing realistic road user behavior is a crucial component to improve pedestrian and cyclist safety. However, such real-world data is still missing. To address this, we propose a visual surveillance system with two perpendicular partially overlapping fields of view, combined with a fully automated deep learning-based pipeline to process and collect video observations, detect and extract road user trajectories in real-world coordinates and estimate human attributes, such as age, gender, smartphone usage, etc. We demonstrate our prototype by deploying it in two locations in a European city.
Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on larg...
详细信息
In the field of autonomous driving, self-training is widely applied to mitigate distribution shifts in LiDAR-based 3D object detectors. This eliminates the need for expensive, high-quality labels whenever the environm...
详细信息
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of vi...
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at https://***/wlin-at/ViTTA.
暂无评论