Federated Class-Incremental Learning (FCIL) focuses on continually transferring the previous knowledge to learn new classes in dynamic Federated Learning (FL). However, existing methods do not consider the trustworthi...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Federated Class-Incremental Learning (FCIL) focuses on continually transferring the previous knowledge to learn new classes in dynamic Federated Learning (FL). However, existing methods do not consider the trustworthiness of FCIL, i.e., improving continual utility, privacy, and efficiency simultaneously, which is greatly influenced by catastrophic forgetting and data heterogeneity among clients. To address this issue, we propose FedProK (Federated Prototypical Feature Knowledge Transfer), leveraging prototypical feature as a novel representation of knowledge to perform spatial-temporal knowledge transfer. Specifically, FedProK consists of two components: (1) feature translation procedure on the client side by temporal knowledge transfer from the learned classes and (2) prototypical knowledge fusion on the server side by spatial knowledge transfer among clients. Extensive experiments conducted in both synchronous and asynchronous settings demonstrate that our FedProK outperforms the other state-of-the-art methods in three perspectives of trustworthiness, validating its effectiveness in selectively transferring spatial-temporal knowledge.
Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended sc...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, InViG, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the InViG dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6% success rate during validation. To the best of our knowledge, the InViG dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: https://***.
This paper presents a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual qual...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper presents a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual quality. However, it tends to overfit the training views when only a few images are available. To address this issue, we employ an adjusted depth map as a geometric reference, derived from a pre-trained monocular depth estimation model and subsequently aligned with the sparse structure-from-motion points. We regularize the optimization process of 3D Gaussian splatting with the adjusted depth and an additional unsupervised smooth constraint, thereby effectively reducing the occurrence of floating artifacts. Our method is mainly validated on the NeRF-LLFF dataset with varying numbers of images, and we conduct multiple experiments with randomly selected training images, presenting the average value to ensure fairness. Our approach demonstrates robust geometry compared to the original method, which relied solely on images.
Segment Anything Models (SAM) have made significant advancements in image segmentation, allowing users to segment target portions of an image with a single click (i.e., user prompt). Given its broad applications, the ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Segment Anything Models (SAM) have made significant advancements in image segmentation, allowing users to segment target portions of an image with a single click (i.e., user prompt). Given its broad applications, the robustness of SAM against adversarial attacks is a critical concern. While recent works have explored adversarial attacks against a pre-defined prompt/click, their threat model is not yet realistic: (1) they often assume the user-click position is known to the attacker (point-based attack), and (2) they often operate under a white-box setting with limited transferability. In this paper, we propose a more practical region-level attack where attackers do not need to know the precise user prompt. The attack remains effective as the user clicks on any point on the target object in the image, hiding the object from SAM. Also, by adapting a spectrum transformation method, we make the attack more transferable under a black-box setting. Both control experiments and testing against real-world SAM services confirm its effectiveness.
With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layer...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos. Code: https://***/sam-motamed/***
Traditional crop disease diagnosis, reliant on expert visual observation, is expensive, time-consuming, and prone to error. While Convolutional Neural Networks (CNNs) offer promising alternatives, their high resource ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Traditional crop disease diagnosis, reliant on expert visual observation, is expensive, time-consuming, and prone to error. While Convolutional Neural Networks (CNNs) offer promising alternatives, their high resource demands limit their accessibility to farmers, particularly those in resource-constrained settings. Lightweight models that operate on resource-limited devices without network access are crucial to address this gap. This paper proposes a Similarity-Preserving Quantization (SPQ) method to convert high-precision CNNs into lower-precision models while maintaining similar feature representations. While quantization offers a promising approach for building lightweight CNNs for crop disease detection, the quality of quantized models often suffers. SPQ addresses this challenge by ensuring equivalent activation patterns for similar crop images in both the original and quantized models. Experimental evaluation using MobileNetV2 and ResNet-50 demonstrates that SPQ improves throughput, inference, and memory footprint more than 3 times while preserving the detection performance.
Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-m...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL
cls
, UCF-GZSL
cls
, and ActivityNet-GZSL
cls
with our new features. Code and data available at: https://***/dkurzend/ClipClap-GZSL.
Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual gui...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.
We study a limited label problem and present a novel approach to Single-Positive Multi-label Learning. In the multi-label learning setting, a model learns to predict multiple labels or categories for a single input im...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We study a limited label problem and present a novel approach to Single-Positive Multi-label Learning. In the multi-label learning setting, a model learns to predict multiple labels or categories for a single input image. This contrasts with standard multi-class image classification, where the task is to predict a single label from many possible labels for an image. Single-Positive Multi-label Learning specifically considers learning to predict multiple labels when there is only one annotation per image in the training data. Multi-label learning is a more natural task than single-label learning because real-world data often involves instances belonging to multiple categories simultaneously; however, most computervision datasets contain single labels due to the inherent complexity and cost of collecting multiple high-quality annotations per image. We propose a novel approach called vision-Language Pseudo-Labeling, which uses a vision-language model, CLIP, to suggest strong positive and negative pseudo-labels. The experiment performance shows the effectiveness of the proposed model. Our code and data will be made publicly available at https://***/mvrl/VLPL.
暂无评论