The advancement of autonomous drones, essential for sectors such as remote sensing and emergency services, is hindered by the absence of training datasets that fully capture the environmental challenges present in rea...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The advancement of autonomous drones, essential for sectors such as remote sensing and emergency services, is hindered by the absence of training datasets that fully capture the environmental challenges present in real-world scenarios, particularly operations in non-optimal weather conditions and the detection of thin structures like wires. We present the Drone Depth and Obstacle Segmentation (DDOS) dataset to fill this critical gap with a collection of synthetic aerial images, created to provide comprehensive training samples for semantic segmentation and depth estimation. Specifically designed to enhance the identification of thin structures, DDOS allows drones to navigate a wide range of weather conditions, significantly elevating drone training and operational safety. Additionally, this work introduces innovative drone-specific metrics aimed at refining the evaluation of algorithms in depth estimation, with a focus on thin structure detection. These contributions not only pave the way for substantial improvements in autonomous drone technology but also set a new benchmark for future research, opening avenues for further advancements in drone navigation and safety.
Recently, semi-supervised domain adaptation (SSDA) approaches have shown impressive performance for the domain adaptation task. They effectively utilize few labeled target samples along with the unlabeled data to acco...
详细信息
ISBN:
(纸本)9781665448994
Recently, semi-supervised domain adaptation (SSDA) approaches have shown impressive performance for the domain adaptation task. They effectively utilize few labeled target samples along with the unlabeled data to account for the distribution shift across the source and target domains. In this work, we make three-fold contributions, concentrating on the role of target samples and semantics for the SSDA task. First, we observe that choosing a few, but an equal number of labeled samples from each class in the target domain requires a significant amount of manual effort. To address this, we propose an active learning-based framework by modeling both the sample diversity and the classifier uncertainty. By utilizing k-means initialized cluster centers for picking a small pool of diverse unlabeled target samples, we compute a novel classifier adaptation uncertainty term to select the most effective samples from this pool, which are queried to obtain their true labels from an oracle. Second, we propose to weigh the hard target samples more, without explicitly using their predicted, possibly incorrect labels, which guides the adaptation process. Third, we note that irrespective of the domain shift, the semantics of the classes remain unchanged, so they can be effectively utilized for this task. We show that initializing the class-representations or prototypes with the class-semantics helps in bridging the domain gap significantly. These along with adversarially learnt entropy objective results in a novel framework, termed STar (Select TARgets), which sets a new state-of-the-art for the SSDA task.
Image quality assessment (IQA) aims to assess the perceptual quality of images. The outputs of the IQA algorithms are expected to be consistent with human subjective perception. In image restoration and enhancement ta...
详细信息
ISBN:
(纸本)9781665448994
Image quality assessment (IQA) aims to assess the perceptual quality of images. The outputs of the IQA algorithms are expected to be consistent with human subjective perception. In image restoration and enhancement tasks, images generated by generative adversarial networks (GAN) can achieve better visual performance than traditional CNN-generated images, although they have spatial shift and texture noise. Unfortunately, the existing IQA methods have unsatisfactory performance on the GAN-based distortion partially because of their low tolerance to spatial misalignment. To this end, we propose the reference-oriented deformable convolution, which can improve the performance of an IQA network on GAN-based distortion by adaptively considering this misalignment. We further propose a patch-level attention module to enhance the interaction among different patch regions, which are processed independently in previous patch-based methods. The modified residual block is also proposed by applying modifications to the classic residual block to construct a patch-region-based baseline called WResNet. Equipping this baseline with the two proposed modules, we further propose Region-Adaptive Deformable Network (RADN). The experiment results on the NTIRE 2021 Perceptual Image Quality Assessment Challenge dataset show the superior performance of RADN, and the ensemble approach won fourth place in the final testing phase of the challenge.
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBALAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBALAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption describing image. It is made possible by combining an Ensembled CLIP score, which considers the semantic alignment between the image and captions, with a Consensus score that accounts for the essentialness of the captions. Using this framework, we achieved notable success in the CVPR 2024 Workshop Challenge on Caption Re-ranking Evaluation at the New Frontiers for Zero-Shot Image Captioning Evaluation (NICE). Specifically, we secured third place based on the CIDEr metric, second in both the SPICE and METEOR metrics, and first in the ROUGE-L and all BLEU Score metrics. The code and configuration for the ECO framework are available at https://***/DSBA-Lab/ECO.
Trajectory prediction, aiming to forecast future trajectories based on past ones, encounters two pivotal issues: insufficient interactions and scene incompetence. The former signifies a lack of consideration for the i...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Trajectory prediction, aiming to forecast future trajectories based on past ones, encounters two pivotal issues: insufficient interactions and scene incompetence. The former signifies a lack of consideration for the interactions of predicted future trajectories among agents, resulting in a potential collision, while the latter indicates the incapacity for learning complex social interactions from simple data. To establish an interaction-aware approach, we propose a diffusion-based model named TrajFine to extract social relationships among agents and refine predictions by considering past predictions and future interactive dynamics. Additionally, we introduce Scene Mixup to facilitate the augmentation via integrating agents from distinct scenes under the Curriculum Learning strategy, progressively increasing the task difficulty during training. Extensive experiments demonstrate the effectiveness of TrajFine for trajectory forecasting by outperforming current SOTAs with significant improvements on the benchmarks.
Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task’s secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at [1].
Image classifiers should be used with caution in the real world. Performance evaluated on a validation set may not reflect performance in the real world. In particular, classifiers may perform well for conditions that...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Image classifiers should be used with caution in the real world. Performance evaluated on a validation set may not reflect performance in the real world. In particular, classifiers may perform well for conditions that are frequently encountered during training, but poorly for other infrequent conditions. In this study, we hypothesize that recent advances in text-to-image generative models make them valuable for benchmarking computervision models such as image classifiers: they can generate images conditioned by textual prompts that cause classifier failures, allowing failure conditions to be described with textual attributes. However, their generation cost becomes an issue when a large number of synthetic images need to be generated, which is the case when many different attribute combinations need to be tested. We propose an image classifier benchmarking method as an iterative process that alternates image generation, classifier evaluation, and attribute selection. This method efficiently explores the attributes that ultimately lead to poor behavior detection.
In this work, we explore the potential of self-supervised learning with Generative Adversarial Networks (GANs) for electron microscopy datasets. We show how self-supervised pretraining facilitates efficient fine-tunin...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
In this work, we explore the potential of self-supervised learning with Generative Adversarial Networks (GANs) for electron microscopy datasets. We show how self-supervised pretraining facilitates efficient fine-tuning for a spectrum of downstream tasks, including semantic segmentation, denoising, noise & background removal, and super-resolution. Experimentation with varying model complexities and receptive field sizes reveals the remarkable phenomenon that fine-tuned models of lower complexity consistently outperform more complex models with random weight initialization. We demonstrate the versatility of self-supervised pretraining across various downstream tasks in the context of electron microscopy, allowing faster convergence and better performance. We conclude that self-supervised pretraining serves as a powerful catalyst, being especially advantageous when limited annotated data are available and efficient scaling of computational cost is important.
Recently, there has been a significant amount of research on Multi-Camera People Tracking (MCPT). MCPT presents more challenges compared to Multi-Object Single Camera Tracking, leading many existing studies to address...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Recently, there has been a significant amount of research on Multi-Camera People Tracking (MCPT). MCPT presents more challenges compared to Multi-Object Single Camera Tracking, leading many existing studies to address them using offline methods. However, offline methods can only analyze pre-recorded videos, which presents less practical application in real industries compared to online methods. Therefore, we aimed to focus on resolving major problems that arise when using the online approach. Specifically, to address problems that could critically affect the performance of the online MCPT, such as storing inaccurate or low-quality appearance features and situations where a person is assigned multiple IDs, we proposed a Cluster Self-Refinement module. We achieved a third-place at the 2024 AI City Challenge Track 1 with a HOTA score of 60.9261%, and our code is available at https://***/nota-github/AIC2024_Track1_Nota.
Looking at a video sequence where a foreground person is represented is not as time ago anymore. Deepfakes have revolutionized our way to watch at such contents and nowadays we are more often used to wonder if what we...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Looking at a video sequence where a foreground person is represented is not as time ago anymore. Deepfakes have revolutionized our way to watch at such contents and nowadays we are more often used to wonder if what we are seeing is real or is just a mystification. In this context of generalized disinformation, the need for reliable solutions to help common users, and not only, to make an assessment on this kind of video sequences is strongly upcoming. In this paper, a novel approach which leverages on temporal surface frame anomalies in order to reveal deepfake videos is introduced. The method searches for possible discrepancies, induced by deepfake manipulation, in the surfaces belonging to the captured scene and in their evolution along the temporal axis. These features are used as input of a pipeline based on deep neural networks to perform a binary assessment on the video itself. Experimental results witness that such a methodology can achieve significant performance in terms of detection accuracy.
暂无评论